Instructions

Technical analysis (R code), 25%. First attempt due on May 1 Sunday, the same time as as your team’s video presentation. Provide an R-markdown, or R script file, which shows the R-code and brief explanations as well as the rationale of the EDA and models you used. This submission, together with your final document (in part VI), will determine your team’s grade. This document shows a technical person the math/stat/codes that you used in your analysis. It should include:

  • Codes of any models that you built
  • Graph and chart outputs with your R codes
  • [When applicable] Codes of any statistical tests you used
  • Model evaluation(s) and comparison

Topic Proposal

The results of our midterm project indicated that there was a relationship between income inequality and mental health, which varied geographically across regions of the U.S. In this next stage, we intend to use statistical models to better understand the strength of the relationship.

We plan to build a linear regression model that assesses the strength of this relationship. Our independent variable for measuring income inequality is the “ratio of household income at the 80th percentile to income at the 20th percentile.” We have two alternative measures of mental health that can be used as dependent variables: the rate of frequent mental distress and the average number of poor mental health days reported in the past 30 days. We also plan to include additional variables that are expected to affect mental health, including median household income, the percentage of households with severe housing problems, the unemployment rate, the child poverty rate, and county-level demographic information. Additionally, regression tree and random forest models will be used to determine variable importance and better predict the prevalence of mental health issues.

This discussion provokes the following SMART questions:

  • Is the relationship between mental health and income inequality across U.S. counties robust when including other economic variables?
  • Does the relationship differ across regions of the U.S.?
  • Does the relationship depend on the measure used for mental health?

Description of the Dataset

The primary dataset used for this project is a combination of annual datasets called the County Health Rankings and Roadmaps National Data and was obtained the Robert Wood Johnson Foundation (RWJF). The County Health Rankings and Roadmaps National Data consists of county-level socioeconomic and public health data. The reports published from 2016 to 2021 provided the most complete data for all the variables of interest. The annual datasets before 2016 were incomplete or did not have variables that exactly corresponded to variables in 2016 to 2021. In general, the years in the datasets reflect the year the report was released by RWJF, but the underlying source data are from a one or two calendar years earlier.

For the exploratory data analysis, the data was also separated into four regions defined by the U.S. Census Bureau (U.S. Census Bureau, 2010).

The variables in the data set are:

  • region: name of the US Census Bureau region (name)
  • division: name of the US Census Bureau division (contained with a census region)
  • state: two letter state abbreviation
  • statecode: FIPS state code
  • countycode: FIPS county code
  • fipscode: 5-digit FIPS Code (county-level); combines statecode and countycode
  • county: county name
  • year: report release year from County Health Rankings; range of 2016-2021
  • county_ranked: Indicates whether or not the county was ranked; 0=unranked, 1=ranked, or NA for aggregated national or state-level data
  • mental_health_days: Average number of mentally unhealthy days reported in past 30 days (age-adjusted)
  • mental_distress_rate: Percentage of adults reporting 14 or more days of poor mental health per month
  • inequality: Ratio of household income at the 80th percentile to income at the 20th percentile (Income inequality)
  • median_inc: The income where half of households in a county earn more and half of households earn less
  • hs_grad: Percentage of adults ages 25 and over with a high school diploma or equivalent
  • college: Percentage of adults ages 25-44 with some post-secondary education
  • unempl: Percentage of population ages 16 and older unemployed but seeking work
  • child_poverty: Percentage of people under age 18 in poverty
  • single_parent: Percentage of children that live in a household headed by single parent
  • severe_housing: Percentage of households with severe housing problems
  • food_index: Index of factors that contribute to a healthy food environment, from 0 (worst) to 10 (best)
  • mh_providers: rate of providers to 100,000 population
  • pop_provider_ratio: ratio of population to mental health providers (i.e., population served per provider)
  • pop: census population estimate
  • pct_below18: percent of population younger than 18
  • pct_black: percent of population that are African-American or non-Hispanic Black
  • pct_native_am: percent of population that are Native American or Alaska Natives
  • pct_asian: percent of population that are Asian
  • pct_pacific: percent of population that are Native Hawaiian or Other Pacific Islander
  • pct_hispanic: percent of population that are Hispanic
  • pct_white: percent of population that are non-Hispanic white or Caucasian
  • pct_female: percent of population that are female
  • pct_rural: percent of population that live in rural areas
# look at county_ranked var; not all counties are ranked; also some aggregated data per state and country exist in the observations
# =1 means they are ranked, =0 means unranked, and =NA is for state/national data
#print(summary(dframe$county_ranked))

# subset of dataframe including only ranked counties
ranked <- dframe %>% subset(county_ranked==1)

# subset of dataframe including only ranked counties
unranked <- dframe %>% subset(county_ranked==0)

# subset of dataframe including only aggregated data
aggregated <- dframe %>% subset(is.na(county_ranked))

# duplicate column and rename level labels for easier reading
ranked$region_abb <- ranked$region
levels(ranked$region_abb) <- c("", 
                              "MW",  # re-level factor labels
                              "NE",
                              "S", 
                              "W")

# subset ranked data by region
ranked_MW <- ranked %>% subset(region=="Midwest")
ranked_NE <- ranked %>% subset(region=="Northeast")
ranked_SO <- ranked %>% subset(region=="South")
ranked_WE <- ranked %>% subset(region=="West")

# subset ranked data into annual datasets
ranked16 <- ranked %>% subset(year==2016)
ranked17 <- ranked %>% subset(year==2017)
ranked18 <- ranked %>% subset(year==2018)
ranked19 <- ranked %>% subset(year==2019)
ranked20 <- ranked %>% subset(year==2020)
ranked21 <- ranked %>% subset(year==2021)

# sort dataframe
ranked <- ranked[order(ranked$year, ranked$region, ranked$division, ranked$statecode, ranked$countycode), ]

Our data are identified at the county-year level. Our data set contains 19164 observations and 32 variables, although 319 of these observations are for aggregated data at the national or state level. In addition, 375 observations are for counties that are unranked by RWJF, suggesting that the data for these counties is less reliable. In total, we have 18470 observations in the ranked data, combined across 6 annual reports (2016–2021). Each annual data set has between 3071 and 3084 observations, indicating that the number of ranked counties is consistent over time.

Exploratory Data Analysis

Descriptive Statistics

#Summary Statistics of all Variables used in dframe dataset
# use ranked instead
table1<-describeBy(ranked, type = 1) ## type of kurtosis and skewness to calculate
table1 %>%
  kbl(caption="Summary Statistics for Master Dataset",
       format= "html", col.names = c("Var Num.","Count","Mean","Std. Dev.","Median", "Trimmed Mean", "Mad", "Minimum", "Maximum","Range", "Skewness", "Kurtosis", "S.E."),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Summary Statistics for Master Dataset
Var Num. Count Mean Std. Dev. Median Trimmed Mean Mad Minimum Maximum Range Skewness Kurtosis S.E.
region* 1 18470 3.40e+00 1.09e+00 4.00e+00 3.37e+00 1.48e+00 2.000 5.00e+00 3.00e+00 -0.198 -1.396 0.008
division* 2 18470 6.50e+00 2.88e+00 8.00e+00 6.63e+00 2.96e+00 2.000 1.00e+01 8.00e+00 -0.379 -1.413 0.021
state* 3 18470 2.70e+01 1.42e+01 2.60e+01 2.73e+01 1.78e+01 1.000 5.10e+01 5.00e+01 -0.028 -1.205 0.105
statecode* 4 18470 2.82e+01 1.43e+01 2.70e+01 2.85e+01 1.78e+01 2.000 5.20e+01 5.00e+01 -0.049 -1.214 0.105
countycode* 5 18470 6.55e+01 5.82e+01 5.30e+01 5.61e+01 4.45e+01 2.000 3.29e+02 3.27e+02 1.949 4.759 0.428
fipscode* 6 18469 1.60e+03 9.23e+02 1.59e+03 1.60e+03 1.18e+03 3.000 3.20e+03 3.20e+03 0.010 -1.204 6.793
county* 7 18470 9.43e+02 5.31e+02 9.45e+02 9.40e+02 6.61e+02 1.000 1.89e+03 1.89e+03 0.046 -1.123 3.908
year* 8 18470 3.50e+00 1.71e+00 4.00e+00 3.50e+00 1.48e+00 1.000 6.00e+00 5.00e+00 -0.002 -1.268 0.013
county_ranked* 9 18470 2.00e+00 0.00e+00 2.00e+00 2.00e+00 0.00e+00 2.000 2.00e+00 0.00e+00 NaN NaN 0.000
mental_health_days 10 18469 4.04e+00 6.92e-01 4.02e+00 4.03e+00 7.06e-01 2.100 7.29e+00 5.19e+00 0.223 -0.112 0.005
mental_distress_rate 11 18469 1.26e-01 2.40e-02 1.24e-01 1.25e-01 2.40e-02 0.066 2.47e-01 1.81e-01 0.532 0.302 0.000
inequality 12 18465 4.52e+00 7.33e-01 4.41e+00 4.45e+00 6.28e-01 2.543 1.20e+01 9.43e+00 1.254 3.433 0.005
median_inc 13 18469 5.08e+04 1.36e+04 4.87e+04 4.93e+04 1.10e+04 21658.000 1.52e+05 1.30e+05 1.424 3.681 99.952
hs_grad 14 16563 8.70e-01 7.90e-02 8.83e-01 8.78e-01 6.70e-02 0.025 1.00e+00 9.75e-01 -1.593 6.583 0.001
college 15 18469 5.71e-01 1.16e-01 5.72e-01 5.72e-01 1.22e-01 0.152 9.11e-01 7.59e-01 -0.088 -0.263 0.001
unempl 16 18469 5.00e-02 2.00e-02 4.60e-02 4.80e-02 1.70e-02 0.012 2.40e-01 2.28e-01 1.690 6.492 0.000
child_poverty 17 18469 2.21e-01 9.10e-02 2.09e-01 2.14e-01 9.00e-02 0.024 7.47e-01 7.23e-01 0.695 0.537 0.001
single_parent 18 18469 3.14e-01 1.06e-01 3.06e-01 3.08e-01 9.60e-02 0.000 8.72e-01 8.72e-01 0.696 1.251 0.001
severe_housing 19 18470 1.42e-01 4.70e-02 1.37e-01 1.39e-01 3.70e-02 0.022 7.13e-01 6.91e-01 2.086 14.166 0.000
food_index 20 18394 7.32e+00 1.18e+00 7.50e+00 7.44e+00 1.04e+00 0.000 1.00e+01 1.00e+01 -1.386 3.743 0.009
mh_providers 21 17028 1.00e-03 2.00e-03 1.00e-03 1.00e-03 1.00e-03 0.000 2.40e-02 2.40e-02 3.457 24.730 0.000
pop_provider_ratio 22 17028 2.00e+03 2.84e+03 9.90e+02 1.38e+03 8.97e+02 -957.000 5.49e+04 5.58e+04 4.291 37.897 21.797
pop 23 18469 1.05e+05 3.34e+05 2.67e+04 4.40e+04 2.79e+04 810.000 1.02e+07 1.02e+07 13.652 310.896 2457.693
pct_below18 24 18469 2.23e-01 3.40e-02 2.22e-01 2.22e-01 2.80e-02 0.051 4.20e-01 3.69e-01 0.500 2.410 0.000
pct_black 25 18469 9.10e-02 1.44e-01 2.30e-02 5.70e-02 2.80e-02 0.000 8.59e-01 8.59e-01 2.265 5.074 0.001
pct_native_am 26 18469 2.30e-02 7.70e-02 6.00e-03 8.00e-03 5.00e-03 0.000 9.31e-01 9.31e-01 7.643 66.364 0.001
pct_asian 27 18469 1.50e-02 2.80e-02 7.00e-03 9.00e-03 5.00e-03 0.000 4.30e-01 4.30e-01 6.880 66.735 0.000
pct_pacific 28 18469 1.00e-03 4.00e-03 1.00e-03 1.00e-03 1.00e-03 0.000 1.31e-01 1.31e-01 21.327 545.017 0.000
pct_hispanic 29 18469 9.40e-02 1.37e-01 4.20e-02 6.10e-02 3.60e-02 0.004 9.64e-01 9.59e-01 3.100 11.051 0.001
pct_white 30 18469 7.63e-01 2.01e-01 8.36e-01 7.94e-01 1.60e-01 0.027 9.81e-01 9.55e-01 -1.185 0.796 0.001
pct_female 31 18469 4.99e-01 2.20e-02 5.03e-01 5.02e-01 1.10e-02 0.265 5.70e-01 3.05e-01 -3.180 17.262 0.000
pct_rural 32 18447 5.78e-01 3.12e-01 5.87e-01 5.90e-01 3.82e-01 0.000 1.00e+00 1.00e+00 -0.131 -1.121 0.002
region_abb* 33 18470 3.40e+00 1.09e+00 4.00e+00 3.37e+00 1.48e+00 2.000 5.00e+00 3.00e+00 -0.198 -1.396 0.008

The summary statistics show that we have nearly complete data for our most relevant quantitative variables – namely, the measures of mental health, income inequality, and median household income. Several other important economic variables, such as child_poverty, single_parent, and severe_housing, have nearly complete data too.

Our intended dependent variables, mental_health_days and mental_distress_rate, each have skewness and kurtosis below 1, which indicates those variables have relatively normal distributions. The mean and median for mentally unhealthy days in the past 30 days are similar at 4.04 and 4.017, with a standard deviation of 0.692. The mean mental_distress_rate is 0.126 with a standard deviation of 0.024. The median is slightly below the mean.

The economic variables, particularly inequality, have greater skewness and kurtosis, suggesting their distributions are less symmetrical and have larger tails. inequality has a mean of 4.524, a median of 4.413, and a standard deviation of 0.733.

Scatterplots

We also plotted the mental health measures vs. inequality by region. The results are striking. The Midwest and the South both exhibit a moderate correlation between the variables. The West has a weaker, but still positive, correlation. And in the Northeast, mental health and inequality do not seem to be correlated.

# data: use ranked_MW, ranked_NE, ranked_SO, ranked_WE

rgb_colors <- c("#A27BB8", "#006994", "#B52E1F", "#00873E")

#-- mental_health_days --#

p1 <- ggplot(ranked_MW, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[1]) + 
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = paste("(a) Midwest"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p2 <- ggplot(ranked_NE, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[2]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(b) Northeast"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p3 <- ggplot(ranked_SO, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[3]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(c) South"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p4 <- ggplot(ranked_WE, aes(y=mental_health_days, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[4]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(d) West"),
       y = 'Days per 30', x = 'Income Inequality Rate')

p_regions_1 <- (p1 + p2)/(p3 + p4) + plot_annotation(title = "Scatterplot of Poor Mental Health Days vs. Income Inequality by Region")
p_regions_1

#-- mental_distress_rate --#

p1 <- ggplot(ranked_MW, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[1]) + 
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") + 
  labs(title = paste("(a) Midwest"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p2 <- ggplot(ranked_NE, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[2]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(b) Northeast"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p3 <- ggplot(ranked_SO, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[3]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(c) South"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p4 <- ggplot(ranked_WE, aes(y=mental_distress_rate, x=inequality)) + 
  geom_point(show.legend = F, alpha = .7, color = rgb_colors[4]) +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE, color="#a89968") +
  labs(title = paste("(d) West"),
       y = '% Frequent Distress', x = 'Income Inequality Rate')

p_regions_2 <- (p1 + p2)/(p3 + p4) + plot_annotation(title = "Scatterplot of Frequent Mental Distress Rate vs. Income Inequality by Region")
p_regions_2

At this point in the EDA, it became clear that geographic differences in the relationship between mental health and inequality were important. These results justified our decision to focus on the differences by region. Our next step in exploration was to use boxplots to get a better sense of how the data varied across regional categories.

Boxplots

To better show the relationship between the economic variables and mental health in different regions of the United States, we generated boxplots of the data by region. Analyzing the economic inequality and mental health of people by region can more clearly assess whether the relationship varies by geography than direct analysis of the whole country.

# Characterization of median household income of four regions
b1 <- ggplot(ranked, aes(x=region_abb, y=median_inc)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Median Income", x="region")

# Characterization of Income inequality of four regions
b2 <- ggplot(ranked, aes(x=region_abb, y=inequality)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Inequality", x="region")

# Characterization of poor mental health days of four regions
b3 <- ggplot(ranked, aes(x=region_abb, y=mental_health_days)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Poor Mental Health Days", x="region")

# Characterization of the Frequent mental distress rate of four regions
b4 <- ggplot(ranked, aes(x=region_abb, y=mental_distress_rate)) + 
  geom_boxplot() + 
  geom_boxplot(colour="orange",fill="#7777cc",outlier.colour="red",outlier.shape=8, outlier.size=4) +
  labs(title="Frequent Mental Distress", x="region")

boxplot_a <- grid.arrange(b1,b2,b3,b4, nrow=2, ncol=2) #, top = text_grob("Boxplots of Key Variables by Region", color = "black", face = "bold", size = 14))

Overall, what information can we conclude from these boxplots?

  1. In the four U.S. regions, the Northeast has the highest median household income and the South has the lowest.
  2. Income inequality is highest in the South.
  3. Residents in the South also have the worst mental health, which correlates with their relatively low household income and higher levels of inequality.

As a whole, there seems to be real differences across regions in these key variables. The boxplots also indicate that these variables have outliers. As a result, we should look into the normality of the data before hypothesis testing.

Normality

We also graphed histograms of many of the numeric variables to consider their distributions. Although none are perfectly “normal” distributions, many do seem to approximate normality. Most importantly, mental_distress_rate and mental_health_days have distributions that somewhat resemble normality. Generally, the economic variables – such as inequality, median_inc, and unempl – are right-skewed.

# variables to plot
keep_var <- c('inequality', 'unempl', 'child_poverty', 'hs_grad', 'college', 'single_parent', 'severe_housing', 'food_index', 'mental_health_days', 'mental_distress_rate', 'median_inc', 'pop_provider_ratio')

# plot histograms
ranked[keep_var] %>% 
  gather() %>% 
  ggplot(aes(x=value)) + 
    facet_wrap(~ key, scales = "free") + 
    geom_histogram(bins=50, color = "#033C5A") +
  labs(title = "Histograms for Select Numeric Variables")

To complement the histograms, we also graphed Q-Q plots to assess the normality of our data. These plots compare the theoretical and sample distributions for a variable at specific quantiles. The line on each graph estimates what the variable would look like with a normal distribution.

In general, the mental health variables do not diverge too far from normality. Also, inequality and median_inc tightly follow a normal distribution until the upper range of their values, reflecting the greater spread that exists at the high end of the distribution.

# plot qq-norm plots
ranked[keep_var] %>% 
  gather() %>% 
  ggplot(aes(sample=value), na.rm=T) + 
    facet_wrap(~ key, scales = "free") + 
    stat_qq(color = "#033C5A") + stat_qq_line() + labs(title = "Q-Q Plots for Select Numeric Variables", y = "Sample Quantiles", x = "Theoretical Quantiles")

Despite not having a perfectly normal distribution, the sample quantiles remain relatively close to the theoretical quantiles for many of our variables. Ultimately, even if none of our variables are perfectly “normal,” few would seem to pose substantial issues during modeling. inequality and median_inc have right-tailed distributions, but without very thick tails.

Correlation Matrix

We also assessed the correlation among the numeric variables. This helps us establish whether variables are positively or negatively associated with each other, as well as the strength of that relationship.

# Correlation Matrices for numeric data

ranked_numeric <- subset(ranked, select = c("mental_health_days", "mental_distress_rate", "inequality", "median_inc", "hs_grad", "college", "unempl", "child_poverty","single_parent", "severe_housing", "food_index","mh_providers","pop_provider_ratio"))

a <- as.matrix(ranked_numeric)

b <- cor(a, use = "na.or.complete")

#corr_numbers <- corrplot(b, is.corr=TRUE, method="number", title="Correlation Matrix for Numeric Vars.",mar=c(0,0,1,0))

corr_numbers <- corrplot(b, is.corr=TRUE, title="Correlation Matrix for Numeric Vars.",mar=c(0,0,1,0))

In general, the results were not surprising. The two mental health variables are very positively correlated, and both are also positively correlated with inequality. In addition, inequality exhibited the strongest positive association with child_poverty and single_parent, and it had the strongest negative correlation with food_index and median_inc. These results make sense (remember that a higher score on the food index indicates a better food environment).

Linear Models

We assessed two sets of models with different dependent variables. The first set of models used mental_health_days as the dependent variable, while the second set used mental_distress_rate as the dependent variable.

Dependent variable: mental_health_days

We arranged several different combinations of variables to evaluate what different might make a suitable linear model with mental_health_days as the dependent variable. First, we regressed mental_health_days on the independent variable of interest – inequality – and a factor variable indicating the US Census Bureau region. Next, we included the factor variable year to control for unobserved differences over time. Third, we integrated a set of economic variables that also might relate to mental health outcomes. Finally, we added demographic variables to the model.

# v1: base model by region
lm_days_1 <- lm(mental_health_days ~ inequality + region, data = ranked)
summ(lm_days_1, vifs = T)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(4,18460) 1341.28
0.23
Adj. R² 0.23
Est. S.E. t val. p VIF
(Intercept) 2.68 0.03 93.25 0.00 NA
inequality 0.25 0.01 37.33 0.00 1.17
regionNortheast 0.21 0.02 11.35 0.00 1.17
regionSouth 0.47 0.01 42.89 0.00 1.17
regionWest 0.12 0.01 8.40 0.00 1.17
Standard errors: OLS
# v2: include year as factor variable
lm_days_2 <- lm(mental_health_days ~ inequality + region + year, data = ranked)
summ(lm_days_2, vifs = T)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(9,18455) 1672.85
0.45
Adj. R² 0.45
Est. S.E. t val. p VIF
(Intercept) 2.32 0.03 90.60 0.00 NA
inequality 0.25 0.01 44.51 0.00 1.17
regionNortheast 0.21 0.02 13.44 0.00 1.17
regionSouth 0.47 0.01 50.79 0.00 1.17
regionWest 0.12 0.01 9.83 0.00 1.17
year2017 0.10 0.01 7.73 0.00 1.00
year2018 0.25 0.01 19.45 0.00 1.00
year2019 0.26 0.01 19.53 0.00 1.00
year2020 0.49 0.01 37.50 0.00 1.00
year2021 1.00 0.01 76.30 0.00 1.00
Standard errors: OLS
# v3: include economic vars
lm_days_3 <- lm(mental_health_days ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + pop_provider_ratio
                + region + year, data = ranked)
summ(lm_days_3, vifs = T)
Observations 15741 (2729 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(18,15722) 1739.73
0.67
Adj. R² 0.67
Est. S.E. t val. p VIF
(Intercept) 3.94 0.08 52.23 0.00 NA
inequality 0.05 0.01 8.49 0.00 1.91
college -0.90 0.04 -21.63 0.00 2.34
hs_grad 0.27 0.05 5.73 0.00 1.42
unempl 6.49 0.23 27.69 0.00 2.18
child_poverty 0.76 0.09 8.34 0.00 6.96
single_parent -0.13 0.05 -2.55 0.01 2.66
severe_housing -0.38 0.09 -4.29 0.00 1.72
food_index -0.02 0.00 -4.96 0.00 2.56
median_inc -0.00 0.00 -28.25 0.00 3.90
pop_provider_ratio -0.00 0.00 -16.86 0.00 1.18
regionNortheast 0.28 0.01 21.77 0.00 2.23
regionSouth 0.27 0.01 31.93 0.00 2.23
regionWest 0.06 0.01 4.81 0.00 2.23
year2017 0.16 0.01 14.71 0.00 1.67
year2018 0.37 0.01 32.97 0.00 1.67
year2019 0.41 0.01 36.07 0.00 1.67
year2020 0.70 0.01 58.55 0.00 1.67
year2021 1.27 0.01 102.28 0.00 1.67
Standard errors: OLS
# v4: include demographic vars
lm_days_4 <- lm(mental_health_days ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + pop_provider_ratio
                + pop + pct_below18 + pct_female + pct_rural
                + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic + pct_white
                + region + year, data = ranked)
summ(lm_days_4, vifs = T)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(28,15699) 1598.75
0.74
Adj. R² 0.74
Est. S.E. t val. p VIF
(Intercept) 5.86 0.32 18.59 0.00 NA
inequality 0.07 0.01 13.49 0.00 2.01
college -1.64 0.04 -38.79 0.00 3.12
hs_grad 0.15 0.04 3.46 0.00 1.52
unempl 5.19 0.21 24.50 0.00 2.25
child_poverty 1.34 0.08 16.05 0.00 7.54
single_parent 0.39 0.05 7.26 0.00 3.90
severe_housing 0.39 0.09 4.30 0.00 2.17
food_index -0.04 0.00 -8.54 0.00 3.07
median_inc -0.00 0.00 -15.86 0.00 5.51
pop_provider_ratio -0.00 0.00 -6.53 0.00 1.28
pop 0.00 0.00 8.59 0.00 1.49
pct_below18 0.90 0.11 8.16 0.00 1.76
pct_female 3.39 0.16 20.90 0.00 1.47
pct_rural -0.22 0.01 -15.29 0.00 2.40
pct_black -4.78 0.31 -15.32 0.00 273.52
pct_native_am -4.13 0.33 -12.57 0.00 61.64
pct_asian -3.95 0.36 -10.93 0.00 14.94
pct_pacific -12.20 1.05 -11.67 0.00 3.04
pct_hispanic -4.97 0.30 -16.48 0.00 216.22
pct_white -3.65 0.31 -11.62 0.00 516.34
regionNortheast 0.25 0.01 20.91 0.00 3.44
regionSouth 0.32 0.01 38.78 0.00 3.44
regionWest 0.19 0.01 17.58 0.00 3.44
year2017 0.16 0.01 16.51 0.00 1.98
year2018 0.38 0.01 38.05 0.00 1.98
year2019 0.44 0.01 42.46 0.00 1.98
year2020 0.72 0.01 67.04 0.00 1.98
year2021 1.32 0.01 114.25 0.00 1.98
Standard errors: OLS

For the first (base) model, the equations for each region are as follows:

  • regionMidwest: mental_health_days = 2.676 + 0.247 * inequality
  • regionNortheast: mental_health_days = 2.676 + 0.212 + 0.247 * inequality
  • regionSouth: mental_health_days = 2.676 + 0.472 + 0.247 * inequality
  • regionWest: mental_health_days = 2.676 + 0.121 + 0.247 * inequality

Similar to the initial results from our EDA, the Midwest region (which is included in the intercept) has the lowest estimate for mental_health_days and the South has the highest. All the coefficients are statistically significant. Our measure for income inequality is statistically significant at the 0.001 level with a coefficient of 0.247, indicating a positive association with poor mental health days.

After including the year factor variables, the economic variables, and the demographic variables, the coefficients on all variables remained statistically significant. However, the coefficient on inequality declined to 0.072 in the fourth model (which included the most variables). This suggests that the base model exhibited omitted variable bias that the fuller model at least partially addressed. The South still had the largest coefficient among the region factor variables.

In general, The other economic variables demonstrated relationships with mental health that were expected. For instance, higher rates forunempl, child_poverty, single_parent, and severe_housing were all associated with worse mental health. A better food environment, higher median household incomes, and having a larger percentage of the population with some post-secondary education were associated with better mental health. Interestingly, a higher high-school graduation rate was positively associated with poor mental health.

After adding so many different variables, we want to check for multicollinearity. Thus, we also evaluated the VIF for each of the above models.

The majority of VIF results are below 5, which is acceptable. However, several variables have extremely high VIFs – specifically, the demographic variables for ethnic groups. The following model removed pct_white, which improved the VIFs significantly. It also removed child_poverty, which had a relatively high VIF of 7.536.

# v5: remove vars to reduce vif
lm_days_5 <- lm(mental_health_days ~ inequality + college + hs_grad + unempl
                + single_parent + severe_housing + food_index
                + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18
                + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic
                + region + year, data = ranked)
summ(lm_days_5, vifs = T)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(25,15702) 1623.59
0.72
Adj. R² 0.72
Est. S.E. t val. p VIF
(Intercept) 2.39 0.09 25.39 0.00 NA
inequality 0.12 0.01 22.51 0.00 1.81
college -2.19 0.04 -55.69 0.00 2.52
hs_grad -0.04 0.04 -0.80 0.42 1.50
unempl 6.37 0.21 30.12 0.00 2.08
single_parent 0.92 0.05 17.60 0.00 3.51
severe_housing 0.47 0.09 5.09 0.00 2.16
food_index -0.10 0.00 -25.17 0.00 2.45
pop 0.00 0.00 9.14 0.00 1.49
pop_provider_ratio -0.00 0.00 -7.84 0.00 1.27
pct_female 4.17 0.17 25.11 0.00 1.43
pct_rural -0.13 0.01 -9.32 0.00 2.23
pct_below18 0.48 0.11 4.25 0.00 1.71
pct_black -1.35 0.04 -38.57 0.00 3.22
pct_native_am -0.65 0.06 -11.14 0.00 1.79
pct_asian -1.13 0.14 -7.81 0.00 2.23
pct_pacific -1.84 0.75 -2.47 0.01 1.45
pct_hispanic -1.43 0.03 -48.60 0.00 1.91
regionNortheast 0.18 0.01 15.24 0.00 3.02
regionSouth 0.32 0.01 39.22 0.00 3.02
regionWest 0.14 0.01 12.27 0.00 3.02
year2017 0.16 0.01 15.98 0.00 1.89
year2018 0.39 0.01 37.62 0.00 1.89
year2019 0.44 0.01 41.84 0.00 1.89
year2020 0.72 0.01 65.10 0.00 1.89
year2021 1.33 0.01 112.53 0.00 1.89
Standard errors: OLS

To permit the coefficients to differ by region, we included an interaction term between inequality and region.

# v6: base model by region with interaction terms
lm_days_6 <- lm(mental_health_days ~ inequality*region, data = ranked)
summ(lm_days_6)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(7,18457) 787.92
0.23
Adj. R² 0.23
Est. S.E. t val. p
(Intercept) 2.36 0.06 40.88 0.00
inequality 0.32 0.01 23.45 0.00
regionNortheast 1.61 0.13 12.29 0.00
regionSouth 0.80 0.07 11.17 0.00
regionWest 0.41 0.10 4.20 0.00
inequality:regionNortheast -0.31 0.03 -10.80 0.00
inequality:regionSouth -0.08 0.02 -4.83 0.00
inequality:regionWest -0.07 0.02 -3.10 0.00
Standard errors: OLS

The equations for each region with interaction terms are as follows:

  • regionMidwest: mental_health_days = 2.3639 + 0.3214 * inequality
  • regionNortheast: mental_health_days = 2.3639 + 1.6126 + (0.3214 - 0.3143) * inequality
  • regionSouth: mental_health_days = 2.3639 + 0.8029 + (0.3214 - 0.0786) * inequality
  • regionWest: mental_health_days = 2.3639 + 0.4104 + (0.3214 - 0.0694) * inequality

These results suggest that the strength of the relationship between mental_health_days and inequality differs by region. In fact, while the Northeast has the highest estimated amount of poor mental health (see the intercept), the coefficient on inequality is near zero. Conversely, the Midwest has the lowest estimate of poor mental health, while exhibiting the strongest association. All the independent variables are statistically significant.

# v7: larger model with interaction terms
lm_days_7 <- lm(mental_health_days ~ inequality*region + college + hs_grad + unempl
                + single_parent + severe_housing + food_index
                + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18
                + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic
                + year, data = ranked)
summ(lm_days_7)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(28,15699) 1460.05
0.72
Adj. R² 0.72
Est. S.E. t val. p
(Intercept) 2.13 0.10 20.84 0.00
inequality 0.18 0.01 17.78 0.00
regionNortheast 0.45 0.08 5.52 0.00
regionSouth 0.73 0.05 14.72 0.00
regionWest 0.19 0.06 2.86 0.00
college -2.19 0.04 -55.86 0.00
hs_grad -0.03 0.04 -0.68 0.50
unempl 6.39 0.21 30.26 0.00
single_parent 0.92 0.05 17.50 0.00
severe_housing 0.44 0.09 4.74 0.00
food_index -0.10 0.00 -24.87 0.00
pop 0.00 0.00 8.84 0.00
pop_provider_ratio -0.00 0.00 -8.03 0.00
pct_female 4.06 0.17 24.40 0.00
pct_rural -0.12 0.01 -8.54 0.00
pct_below18 0.63 0.11 5.50 0.00
pct_black -1.31 0.04 -36.96 0.00
pct_native_am -0.76 0.06 -12.65 0.00
pct_asian -1.25 0.15 -8.59 0.00
pct_pacific -1.53 0.75 -2.05 0.04
pct_hispanic -1.44 0.03 -48.51 0.00
year2017 0.16 0.01 16.04 0.00
year2018 0.39 0.01 37.72 0.00
year2019 0.44 0.01 41.95 0.00
year2020 0.72 0.01 65.34 0.00
year2021 1.33 0.01 112.82 0.00
inequality:regionNortheast -0.06 0.02 -3.40 0.00
inequality:regionSouth -0.09 0.01 -8.26 0.00
inequality:regionWest -0.01 0.01 -0.81 0.42
Standard errors: OLS
vif(lm_days_7)
            inequality            regionNortheast 
                  6.91                      61.76 
           regionSouth                 regionWest 
                 78.81                      62.04 
               college                    hs_grad 
                  2.52                       1.50 
                unempl              single_parent 
                  2.09                       3.52 
        severe_housing                 food_index 
                  2.17                       2.47 
                   pop         pop_provider_ratio 
                  1.50                       1.28 
            pct_female                  pct_rural 
                  1.44                       2.25 
           pct_below18                  pct_black 
                  1.77                       3.29 
         pct_native_am                  pct_asian 
                  1.91                       2.27 
           pct_pacific               pct_hispanic 
                  1.45                       1.95 
              year2017                   year2018 
                  1.74                       1.82 
              year2019                   year2020 
                  2.07                       2.24 
              year2021 inequality:regionNortheast 
                  2.36                      65.24 
inequality:regionSouth      inequality:regionWest 
                 99.90                      66.10 

The equations for the fuller models with interaction terms are below, where Vc is a vector of additional control variables:

  • regionMidwest: mental_health_days = 2.13 + 0.182 * inequality + Vc
  • regionNortheast: mental_health_days = 2.13 + 0.448 + (0.182 - 0.0616) * inequality + Vc
  • regionSouth: mental_health_days = 2.13 + 0.734 + (0.182 - 0.0944) * inequality + Vc
  • regionWest: mental_health_days = 2.13 + 0.186 + (0.182 - 0.0121) * inequality + Vc

After including the set of control variables, the coefficients for inequality still differ by region. However, now it appears that the South has the weakest association between inequality and mental_health_days and the Midwest still has the strongest.

The following table shows the regression results for models 1, 4, 5, and 7 (from above).

export_summs(lm_days_1, lm_days_4, lm_days_5, lm_days_7,
             #error_format = "({p.value})",
             statistics = c(Num.obs = "nobs", R2 = "r.squared", adj.R2 = "adj.r.squared"), 
             model.names = c("1", "4", "5", "7"), 
             coefs = c("(Intercept)", "inequality", 
                       "Northeast" = "regionNortheast", "South" = "regionSouth", "West" = "regionWest", 
                       "inequality:NE" = "inequality:regionNortheast", 
                       "inequality:SO" = "inequality:regionSouth", 
                       "inequality:WE" = "inequality:regionWest",
                       "college", "hs_grad", "unempl", "child_poverty", "single_parent", 
                        "severe_housing", "food_index", "median_inc")
             )
1457
(Intercept)2.68 ***5.86 ***2.39 ***2.13 ***
(0.03)   (0.32)   (0.09)   (0.10)   
inequality0.25 ***0.07 ***0.12 ***0.18 ***
(0.01)   (0.01)   (0.01)   (0.01)   
Northeast0.21 ***0.25 ***0.18 ***0.45 ***
(0.02)   (0.01)   (0.01)   (0.08)   
South0.47 ***0.32 ***0.32 ***0.73 ***
(0.01)   (0.01)   (0.01)   (0.05)   
West0.12 ***0.19 ***0.14 ***0.19 ** 
(0.01)   (0.01)   (0.01)   (0.06)   
inequality:NE                     -0.06 ***
                     (0.02)   
inequality:SO                     -0.09 ***
                     (0.01)   
inequality:WE                     -0.01    
                     (0.01)   
college       -1.64 ***-2.19 ***-2.19 ***
       (0.04)   (0.04)   (0.04)   
hs_grad       0.15 ***-0.04    -0.03    
       (0.04)   (0.04)   (0.04)   
unempl       5.19 ***6.37 ***6.39 ***
       (0.21)   (0.21)   (0.21)   
child_poverty       1.34 ***              
       (0.08)                 
single_parent       0.39 ***0.92 ***0.92 ***
       (0.05)   (0.05)   (0.05)   
severe_housing       0.39 ***0.47 ***0.44 ***
       (0.09)   (0.09)   (0.09)   
food_index       -0.04 ***-0.10 ***-0.10 ***
       (0.00)   (0.00)   (0.00)   
median_inc       -0.00 ***              
       (0.00)                 
Num.obs18465       15728       15728       15728       
R20.23    0.74    0.72    0.72    
adj.R20.23    0.74    0.72    0.72    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Dependent variable: mental_distress_rate

Assigned to: Xuan My Code is from line 532 to line 979. My code on lines 532 to 642 refers to Mark’s, and I followed his advice. His code is beautiful, for which I really appreciate it. Thanks Mark!

In the following models, I used different combinations of variables to find the most suitable model.

# x1: base model by region
lm_rate_1 <- lm(mental_distress_rate ~ inequality + region, data = ranked)
summary(lm_rate_1)

Call: lm(formula = mental_distress_rate ~ inequality + region, data = ranked)

Residuals: Min 1Q Median 3Q Max -0.06811 -0.01442 -0.00187 0.01192 0.12544

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.073481 0.000975 75.33 < 2e-16 inequality 0.010104 0.000225 45.00 < 2e-16 regionNortheast 0.001054 0.000636 1.66 0.098 .
regionSouth 0.014139 0.000374 37.84 < 2e-16 regionWest 0.002611 0.000491 5.32 1.1e-07 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.0207 on 18460 degrees of freedom (5 observations deleted due to missingness) Multiple R-squared: 0.241, Adjusted R-squared: 0.241 F-statistic: 1.47e+03 on 4 and 18460 DF, p-value: <2e-16

# x2: include year as factor variable
lm_rate_2 <- lm(mental_distress_rate ~ inequality + region + year, data = ranked)
summary(lm_rate_2)

Call: lm(formula = mental_distress_rate ~ inequality + region + year, data = ranked)

Residuals: Min 1Q Median 3Q Max -0.07870 -0.01157 -0.00085 0.01112 0.10787

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.059954 0.000818 73.30 < 2e-16 inequality 0.010153 0.000178 57.08 < 2e-16 regionNortheast 0.001042 0.000504 2.07 0.039 *
regionSouth 0.014113 0.000296 47.68 < 2e-16 regionWest 0.002550 0.000389 6.56 5.7e-11 year2017 0.004211 0.000418 10.07 < 2e-16 year2018 0.009621 0.000418 23.01 < 2e-16 year2019 0.009662 0.000418 23.12 < 2e-16 year2020 0.017566 0.000418 42.03 < 2e-16 year2021 0.038843 0.000418 92.94 < 2e-16 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.0164 on 18455 degrees of freedom (5 observations deleted due to missingness) Multiple R-squared: 0.524, Adjusted R-squared: 0.524 F-statistic: 2.26e+03 on 9 and 18455 DF, p-value: <2e-16

# x3: include economic vars
lm_rate_3 <- lm(mental_distress_rate ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + pop_provider_ratio
                + region + year, data = ranked)
summary(lm_rate_3)

Call: lm(formula = mental_distress_rate ~ inequality + college + hs_grad + unempl + child_poverty + single_parent + severe_housing + food_index + median_inc + pop_provider_ratio + region + year, data = ranked)

Residuals: Min 1Q Median 3Q Max -0.04796 -0.00754 -0.00022 0.00767 0.06003

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.25e-01 2.16e-03 57.82 < 2e-16 inequality 1.94e-03 1.69e-04 11.50 < 2e-16 college -3.04e-02 1.19e-03 -25.61 < 2e-16 hs_grad 7.21e-03 1.34e-03 5.37 7.8e-08 unempl 1.83e-01 6.70e-03 27.35 < 2e-16 child_poverty 3.91e-02 2.60e-03 15.01 < 2e-16 single_parent -4.19e-03 1.43e-03 -2.93 0.0034 ** severe_housing 5.42e-03 2.54e-03 2.13 0.0331 *
food_index -1.62e-03 1.30e-04 -12.51 < 2e-16 median_inc -4.11e-07 1.26e-08 -32.68 < 2e-16 pop_provider_ratio -4.94e-07 3.32e-08 -14.87 < 2e-16 regionNortheast 4.77e-03 3.73e-04 12.80 < 2e-16 regionSouth 6.90e-03 2.43e-04 28.41 < 2e-16 regionWest -1.05e-05 3.33e-04 -0.03 0.9748
year2017 6.08e-03 3.16e-04 19.26 < 2e-16
year2018 1.33e-02 3.21e-04 41.57 < 2e-16 year2019 1.51e-02 3.27e-04 46.23 < 2e-16 year2020 2.48e-02 3.40e-04 73.00 < 2e-16 year2021 4.83e-02 3.55e-04 135.98 < 2e-16 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.0111 on 15722 degrees of freedom (2729 observations deleted due to missingness) Multiple R-squared: 0.771, Adjusted R-squared: 0.771 F-statistic: 2.95e+03 on 18 and 15722 DF, p-value: <2e-16

# x4: include demographic vars
lm_rate_4 <- lm(mental_distress_rate ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + pop_provider_ratio
                + pop + pct_below18 + pct_female + pct_rural
                + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic + pct_white
                + region + year, data = ranked)
summary(lm_rate_4)

Call: lm(formula = mental_distress_rate ~ inequality + college + hs_grad + unempl + child_poverty + single_parent + severe_housing + food_index + median_inc + pop_provider_ratio + pop + pct_below18 + pct_female + pct_rural + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic + pct_white + region + year, data = ranked)

Residuals: Min 1Q Median 3Q Max -0.03983 -0.00654 -0.00004 0.00656 0.05788

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.58e-01 9.05e-03 17.41 < 2e-16 inequality 2.74e-03 1.53e-04 17.91 < 2e-16 college -4.72e-02 1.21e-03 -38.82 < 2e-16 hs_grad 5.26e-03 1.23e-03 4.28 1.9e-05 unempl 1.38e-01 6.08e-03 22.69 < 2e-16 child_poverty 5.34e-02 2.40e-03 22.28 < 2e-16 single_parent 1.04e-02 1.53e-03 6.82 9.6e-12 severe_housing 2.28e-02 2.58e-03 8.86 < 2e-16 food_index -1.60e-03 1.26e-04 -12.71 < 2e-16 median_inc -2.93e-07 1.32e-08 -22.19 < 2e-16 pop_provider_ratio -1.99e-07 3.06e-08 -6.50 8.1e-11 pop 1.54e-09 2.65e-10 5.82 6.0e-09 pct_below18 4.84e-02 3.16e-03 15.30 < 2e-16 pct_female 9.65e-02 4.65e-03 20.74 < 2e-16 pct_rural -3.93e-03 4.10e-04 -9.57 < 2e-16 pct_black -1.22e-01 8.94e-03 -13.65 < 2e-16 pct_native_am -9.17e-02 9.42e-03 -9.74 < 2e-16 pct_asian -9.99e-02 1.04e-02 -9.64 < 2e-16 pct_pacific -3.25e-01 3.00e-02 -10.83 < 2e-16 pct_hispanic -1.30e-01 8.65e-03 -14.99 < 2e-16 pct_white -9.38e-02 9.01e-03 -10.41 < 2e-16 regionNortheast 4.86e-03 3.41e-04 14.24 < 2e-16 regionSouth 8.65e-03 2.37e-04 36.50 < 2e-16 regionWest 3.86e-03 3.12e-04 12.38 < 2e-16 year2017 5.94e-03 2.80e-04 21.24 < 2e-16 year2018 1.33e-02 2.86e-04 46.55 < 2e-16 year2019 1.53e-02 2.94e-04 51.97 < 2e-16 year2020 2.50e-02 3.07e-04 81.19 < 2e-16 year2021 4.94e-02 3.32e-04 148.82 < 2e-16 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.00977 on 15699 degrees of freedom (2742 observations deleted due to missingness) Multiple R-squared: 0.821, Adjusted R-squared: 0.82 F-statistic: 2.57e+03 on 28 and 15699 DF, p-value: <2e-16

#From the first model, we can see that the equations for each region are below:
# regionMidwest: mental_distress_rate = 0.073 + 0.010 * inequality
# regionNortheast: mental_distress_rate = 0.073 + 0.001 + 0.010 * inequality
# regionSouth: mental_distress_rate = 0.073 + 0.014 + 0.010 * inequality
# regionWest: mental_distress_rate = 0.073 + 0.002 + 0.010 * inequality
# From above, we can see that the Midwest has the lowest estimate for mental_distress_rate, and the South has the highest estimate. The northeast region shows not significant, others are significant.
# After including the year factor variables, the economic variables, and the demographic variables, the coefficients on all variables remained significant. The northeast continue to show not significant. However, the coefficient on 'inequality' declined from 0.0101 to 0.00274 in the fourth model, which suggests that the base model exhibited omitted variable bias that the fuller model at least partially addressed. The South region still has the largest coefficient among the 'region' factor variables.
#After adding so many different variables, we need to check for multicollinearity. Thus, we also evaluated the VIF for each of the above models.

summ(lm_rate_1, vifs = T)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(4,18460) 1465.34
0.24
Adj. R² 0.24
Est. S.E. t val. p VIF
(Intercept) 0.07 0.00 75.33 0.00 NA
inequality 0.01 0.00 45.00 0.00 1.17
regionNortheast 0.00 0.00 1.66 0.10 1.17
regionSouth 0.01 0.00 37.84 0.00 1.17
regionWest 0.00 0.00 5.32 0.00 1.17
Standard errors: OLS
summ(lm_rate_2, vifs = T)
Observations 18465 (5 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(9,18455) 2255.52
0.52
Adj. R² 0.52
Est. S.E. t val. p VIF
(Intercept) 0.06 0.00 73.30 0.00 NA
inequality 0.01 0.00 57.08 0.00 1.17
regionNortheast 0.00 0.00 2.07 0.04 1.17
regionSouth 0.01 0.00 47.68 0.00 1.17
regionWest 0.00 0.00 6.56 0.00 1.17
year2017 0.00 0.00 10.07 0.00 1.00
year2018 0.01 0.00 23.01 0.00 1.00
year2019 0.01 0.00 23.12 0.00 1.00
year2020 0.02 0.00 42.03 0.00 1.00
year2021 0.04 0.00 92.94 0.00 1.00
Standard errors: OLS
summ(lm_rate_3, vifs = T)
Observations 15741 (2729 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(18,15722) 2945.24
0.77
Adj. R² 0.77
Est. S.E. t val. p VIF
(Intercept) 0.12 0.00 57.82 0.00 NA
inequality 0.00 0.00 11.50 0.00 1.91
college -0.03 0.00 -25.61 0.00 2.34
hs_grad 0.01 0.00 5.37 0.00 1.42
unempl 0.18 0.01 27.35 0.00 2.18
child_poverty 0.04 0.00 15.01 0.00 6.96
single_parent -0.00 0.00 -2.93 0.00 2.66
severe_housing 0.01 0.00 2.13 0.03 1.72
food_index -0.00 0.00 -12.51 0.00 2.56
median_inc -0.00 0.00 -32.68 0.00 3.90
pop_provider_ratio -0.00 0.00 -14.87 0.00 1.18
regionNortheast 0.00 0.00 12.80 0.00 2.23
regionSouth 0.01 0.00 28.41 0.00 2.23
regionWest -0.00 0.00 -0.03 0.97 2.23
year2017 0.01 0.00 19.26 0.00 1.67
year2018 0.01 0.00 41.57 0.00 1.67
year2019 0.02 0.00 46.23 0.00 1.67
year2020 0.02 0.00 73.00 0.00 1.67
year2021 0.05 0.00 135.98 0.00 1.67
Standard errors: OLS
summ(lm_rate_4, vifs = T)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(28,15699) 2565.65
0.82
Adj. R² 0.82
Est. S.E. t val. p VIF
(Intercept) 0.16 0.01 17.41 0.00 NA
inequality 0.00 0.00 17.91 0.00 2.01
college -0.05 0.00 -38.82 0.00 3.12
hs_grad 0.01 0.00 4.28 0.00 1.52
unempl 0.14 0.01 22.69 0.00 2.25
child_poverty 0.05 0.00 22.28 0.00 7.54
single_parent 0.01 0.00 6.82 0.00 3.90
severe_housing 0.02 0.00 8.86 0.00 2.17
food_index -0.00 0.00 -12.71 0.00 3.07
median_inc -0.00 0.00 -22.19 0.00 5.51
pop_provider_ratio -0.00 0.00 -6.50 0.00 1.28
pop 0.00 0.00 5.82 0.00 1.49
pct_below18 0.05 0.00 15.30 0.00 1.76
pct_female 0.10 0.00 20.74 0.00 1.47
pct_rural -0.00 0.00 -9.57 0.00 2.40
pct_black -0.12 0.01 -13.65 0.00 273.52
pct_native_am -0.09 0.01 -9.74 0.00 61.64
pct_asian -0.10 0.01 -9.64 0.00 14.94
pct_pacific -0.32 0.03 -10.83 0.00 3.04
pct_hispanic -0.13 0.01 -14.99 0.00 216.22
pct_white -0.09 0.01 -10.41 0.00 516.34
regionNortheast 0.00 0.00 14.24 0.00 3.44
regionSouth 0.01 0.00 36.50 0.00 3.44
regionWest 0.00 0.00 12.38 0.00 3.44
year2017 0.01 0.00 21.24 0.00 1.98
year2018 0.01 0.00 46.55 0.00 1.98
year2019 0.02 0.00 51.97 0.00 1.98
year2020 0.02 0.00 81.19 0.00 1.98
year2021 0.05 0.00 148.82 0.00 1.98
Standard errors: OLS
# The majority of VIF results are below 5, which is acceptable. However, several variables have extremely high VIFs – specifically, such as 'child_poverty', 'pct_black', 'pct_white' and 'pct_hispanic'. We would remove these in the following model.
# x5: remove vars to reduce vif
lm_rate_5 <- lm(mental_distress_rate ~ inequality + college + hs_grad + unempl
                + single_parent + severe_housing + food_index
                + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18
                + pct_native_am + pct_asian + pct_pacific 
                + region + year, data = ranked)
summary(lm_rate_5)

Call: lm(formula = mental_distress_rate ~ inequality + college + hs_grad + unempl + single_parent + severe_housing + food_index + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18 + pct_native_am + pct_asian + pct_pacific + region + year, data = ranked)

Residuals: Min 1Q Median 3Q Max -0.05266 -0.00735 -0.00003 0.00740 0.05017

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.69e-02 2.97e-03 19.14 < 2e-16 inequality 4.18e-03 1.66e-04 25.22 < 2e-16 college -6.95e-02 1.20e-03 -57.91 < 2e-16 hs_grad 4.49e-03 1.37e-03 3.28 0.0011 unempl 1.87e-01 6.69e-03 27.94 < 2e-16 single_parent 4.00e-03 1.42e-03 2.82 0.0048 severe_housing -7.69e-03 2.85e-03 -2.70 0.0069 ** food_index -3.82e-03 1.25e-04 -30.52 < 2e-16 pop 6.72e-10 3.02e-10 2.22 0.0261
pop_provider_ratio -4.76e-07 3.46e-08 -13.75 < 2e-16
pct_female 1.88e-01 5.09e-03 36.87 < 2e-16 pct_rural 3.48e-03 4.34e-04 8.02 1.1e-15 pct_below18 -3.62e-02 3.24e-03 -11.17 < 2e-16 pct_native_am 2.58e-02 1.65e-03 15.59 < 2e-16 pct_asian -5.54e-02 4.56e-03 -12.16 < 2e-16 pct_pacific 7.58e-02 2.35e-02 3.22 0.0013 regionNortheast 1.52e-03 3.83e-04 3.98 6.9e-05 regionSouth 4.73e-03 2.44e-04 19.41 < 2e-16 regionWest -8.13e-04 3.40e-04 -2.39 0.0170 *
year2017 5.63e-03 3.19e-04 17.64 < 2e-16 year2018 1.30e-02 3.26e-04 39.88 < 2e-16 year2019 1.42e-02 3.34e-04 42.55 < 2e-16 year2020 2.31e-02 3.46e-04 66.71 < 2e-16 year2021 4.55e-02 3.57e-04 127.31 < 2e-16 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.0112 on 15704 degrees of freedom (2742 observations deleted due to missingness) Multiple R-squared: 0.765, Adjusted R-squared: 0.765 F-statistic: 2.23e+03 on 23 and 15704 DF, p-value: <2e-16

summ(lm_rate_5, vifs = T)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_distress_rate
Type OLS linear regression
F(23,15704) 2225.58
0.77
Adj. R² 0.76
Est. S.E. t val. p VIF
(Intercept) 0.06 0.00 19.14 0.00 NA
inequality 0.00 0.00 25.22 0.00 1.80
college -0.07 0.00 -57.91 0.00 2.33
hs_grad 0.00 0.00 3.28 0.00 1.44
unempl 0.19 0.01 27.94 0.00 2.08
single_parent 0.00 0.00 2.82 0.00 2.56
severe_housing -0.01 0.00 -2.70 0.01 2.02
food_index -0.00 0.00 -30.52 0.00 2.30
pop 0.00 0.00 2.22 0.03 1.48
pop_provider_ratio -0.00 0.00 -13.75 0.00 1.25
pct_female 0.19 0.01 36.87 0.00 1.34
pct_rural 0.00 0.00 8.02 0.00 2.06
pct_below18 -0.04 0.00 -11.17 0.00 1.41
pct_native_am 0.03 0.00 15.59 0.00 1.45
pct_asian -0.06 0.00 -12.16 0.00 2.21
pct_pacific 0.08 0.02 3.22 0.00 1.43
regionNortheast 0.00 0.00 3.98 0.00 2.40
regionSouth 0.00 0.00 19.41 0.00 2.40
regionWest -0.00 0.00 -2.39 0.02 2.40
year2017 0.01 0.00 17.64 0.00 1.68
year2018 0.01 0.00 39.88 0.00 1.68
year2019 0.01 0.00 42.55 0.00 1.68
year2020 0.02 0.00 66.71 0.00 1.68
year2021 0.05 0.00 127.31 0.00 1.68
Standard errors: OLS
## To permit the coefficients to differ by region, we included an interaction term between `inequality` and `region`.
# x6: base model by region with interaction terms
lm_rate_6 <- lm(mental_distress_rate ~ inequality*region, data = ranked)
summary(lm_rate_6)

Call: lm(formula = mental_distress_rate ~ inequality * region, data = ranked)

Residuals: Min 1Q Median 3Q Max -0.06887 -0.01432 -0.00196 0.01193 0.12541

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.064421 0.001965 32.79 < 2e-16 inequality 0.012271 0.000466 26.36 < 2e-16 regionNortheast 0.050482 0.004457 11.33 < 2e-16 regionSouth 0.022052 0.002441 9.03 < 2e-16 regionWest 0.013255 0.003322 3.99 6.6e-05 inequality:regionNortheast -0.011059 0.000989 -11.18 < 2e-16 inequality:regionSouth -0.001928 0.000553 -3.49 0.00049 inequality:regionWest -0.002526 0.000761 -3.32 0.00090 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.0206 on 18457 degrees of freedom (5 observations deleted due to missingness) Multiple R-squared: 0.246, Adjusted R-squared: 0.246 F-statistic: 861 on 7 and 18457 DF, p-value: <2e-16

#From the model x6, we can see that the equations for each region are below:
# regionMidwest: mental_distress_rate = 0.064421 + 0.012271 * inequality
# regionNortheast: mental_distress_rate = 0.064421 + 0.050482 + (0.012271 - 0.011059) * inequality
# regionSouth: mental_distress_rate = 0.064421 + 0.022052 + (0.012271 - 0.001928) * inequality
# regionWest: mental_distress_rate = 0.064421 + 0.013255 + (0.012271 - 0.002526) * inequality
# From the above equations, we can see that the strength of the relationship between 'mental_distress_rate' and 'inequality' differs by region. In fact, while the Northeast has the highest estimated amount of mental distress rate (because the intercerpt of northeast region is '0.114903'). The coefficient on 'inequality' of northeast region is '0.001212' near 0. Conversely, the midwest region has the lowest estimate of mental distress rate (0.064421), while exhibiting the strongest association. All the independent variables are statistically significant. 
# x7: larger model with interaction terms
lm_rate_7 <- lm(mental_health_days ~ inequality*region + college + hs_grad + unempl
                + single_parent + severe_housing + food_index
                + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18
                + pct_native_am + pct_asian + pct_pacific
                + year, data = ranked)
summary(lm_days_7)

Call: lm(formula = mental_health_days ~ inequality * region + college + hs_grad + unempl + single_parent + severe_housing + food_index + pop + pop_provider_ratio + pct_female + pct_rural + pct_below18 + pct_black + pct_native_am + pct_asian + pct_pacific + pct_hispanic + year, data = ranked)

Residuals: Min 1Q Median 3Q Max -1.5712 -0.2374 -0.0007 0.2402 1.7237

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.13e+00 1.02e-01 20.84 < 2e-16 inequality 1.82e-01 1.02e-02 17.78 < 2e-16 regionNortheast 4.48e-01 8.13e-02 5.52 3.5e-08 regionSouth 7.34e-01 4.99e-02 14.72 < 2e-16 regionWest 1.86e-01 6.49e-02 2.86 0.00425 ** college -2.19e+00 3.93e-02 -55.86 < 2e-16 hs_grad -2.99e-02 4.41e-02 -0.68 0.49749
unempl 6.39e+00 2.11e-01 30.26 < 2e-16
single_parent 9.17e-01 5.24e-02 17.50 < 2e-16 severe_housing 4.40e-01 9.30e-02 4.74 2.2e-06 food_index -1.01e-01 4.08e-03 -24.87 < 2e-16 pop 8.46e-08 9.57e-09 8.84 < 2e-16 pop_provider_ratio -8.85e-06 1.10e-06 -8.03 1.0e-15 pct_female 4.06e+00 1.66e-01 24.40 < 2e-16 pct_rural -1.22e-01 1.43e-02 -8.54 < 2e-16 pct_below18 6.30e-01 1.14e-01 5.50 3.8e-08 pct_black -1.31e+00 3.53e-02 -36.96 < 2e-16 pct_native_am -7.55e-01 5.97e-02 -12.65 < 2e-16 pct_asian -1.25e+00 1.46e-01 -8.59 < 2e-16 pct_pacific -1.53e+00 7.47e-01 -2.05 0.04083
pct_hispanic -1.44e+00 2.96e-02 -48.51 < 2e-16
year2017 1.61e-01 1.01e-02 16.04 < 2e-16 year2018 3.88e-01 1.03e-02 37.72 < 2e-16 year2019 4.44e-01 1.06e-02 41.95 < 2e-16 year2020 7.17e-01 1.10e-02 65.34 < 2e-16 year2021 1.33e+00 1.18e-02 112.82 < 2e-16 inequality:regionNortheast -6.16e-02 1.81e-02 -3.40 0.00069 inequality:regionSouth -9.44e-02 1.14e-02 -8.26 < 2e-16 ** inequality:regionWest -1.21e-02 1.49e-02 -0.81 0.41710
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.352 on 15699 degrees of freedom (2742 observations deleted due to missingness) Multiple R-squared: 0.723, Adjusted R-squared: 0.722 F-statistic: 1.46e+03 on 28 and 15699 DF, p-value: <2e-16

summ(lm_days_7, vifs = T)
Observations 15728 (2742 missing obs. deleted)
Dependent variable mental_health_days
Type OLS linear regression
F(28,15699) 1460.05
0.72
Adj. R² 0.72
Est. S.E. t val. p VIF
(Intercept) 2.13 0.10 20.84 0.00 NA
inequality 0.18 0.01 17.78 0.00 6.91
regionNortheast 0.45 0.08 5.52 0.00 142596.16
regionSouth 0.73 0.05 14.72 0.00 142596.16
regionWest 0.19 0.06 2.86 0.00 142596.16
college -2.19 0.04 -55.86 0.00 2.52
hs_grad -0.03 0.04 -0.68 0.50 1.50
unempl 6.39 0.21 30.26 0.00 2.09
single_parent 0.92 0.05 17.50 0.00 3.52
severe_housing 0.44 0.09 4.74 0.00 2.17
food_index -0.10 0.00 -24.87 0.00 2.47
pop 0.00 0.00 8.84 0.00 1.50
pop_provider_ratio -0.00 0.00 -8.03 0.00 1.28
pct_female 4.06 0.17 24.40 0.00 1.44
pct_rural -0.12 0.01 -8.54 0.00 2.25
pct_below18 0.63 0.11 5.50 0.00 1.77
pct_black -1.31 0.04 -36.96 0.00 3.29
pct_native_am -0.76 0.06 -12.65 0.00 1.91
pct_asian -1.25 0.15 -8.59 0.00 2.27
pct_pacific -1.53 0.75 -2.05 0.04 1.45
pct_hispanic -1.44 0.03 -48.51 0.00 1.95
year2017 0.16 0.01 16.04 0.00 1.89
year2018 0.39 0.01 37.72 0.00 1.89
year2019 0.44 0.01 41.95 0.00 1.89
year2020 0.72 0.01 65.34 0.00 1.89
year2021 1.33 0.01 112.82 0.00 1.89
inequality:regionNortheast -0.06 0.02 -3.40 0.00 192245.88
inequality:regionSouth -0.09 0.01 -8.26 0.00 192245.88
inequality:regionWest -0.01 0.01 -0.81 0.42 192245.88
Standard errors: OLS
# The equations for the fuller models with interaction terms are below, where V~c~ is a vector of additional control variables:
# regionMidwest: mental_distress_rate = 2.13 + 0.182 * inequality + V~c~
# regionNortheast: mental_distress_rate = 2.13 + 0.448 + (0.182 - 0.0616) * inequality + V~c~
# regionSouth: mental_distress_rate = 2.13 + 0.734 + (0.01821 - 0.0944) * inequality + V~c~
# regionWest: mental_distress_rate = 2.13 + 0.186 + (0.0182 - 0.0121) * inequality + V~c~

# After including the set of control variables, the coefficients for `inequality` still differ by region. However, now it appears that the South has the weakest association between `inequality` and `mental_distress_rate` and the Midwest still has the strongest.
export_summs(lm_rate_1, lm_rate_4, lm_rate_5, lm_rate_7,
             statistics = c(Num.obs = "nobs", R2 = "r.squared", adj.R2 = "adj.r.squared"), 
             model.names = c("1", "4", "5", "7"), 
             coefs = c("(Intercept)", "inequality", 
                       "Northeast" = "regionNortheast", "South" = "regionSouth", "West" = "regionWest", 
                       "inequality:NE" = "inequality:regionNortheast", 
                       "inequality:SO" = "inequality:regionSouth", 
                       "inequality:WE" = "inequality:regionWest",
                       "college", "hs_grad", "unempl", "child_poverty", "single_parent", 
                        "severe_housing", "food_index", "median_inc")
             )
1457
(Intercept)0.07 ***0.16 ***0.06 ***1.46 ***
(0.00)   (0.01)   (0.00)   (0.11)   
inequality0.01 ***0.00 ***0.00 ***0.21 ***
(0.00)   (0.00)   (0.00)   (0.01)   
Northeast0.00    0.00 ***0.00 ***0.68 ***
(0.00)   (0.00)   (0.00)   (0.09)   
South0.01 ***0.01 ***0.00 ***0.81 ***
(0.00)   (0.00)   (0.00)   (0.05)   
West0.00 ***0.00 ***-0.00 *  0.49 ***
(0.00)   (0.00)   (0.00)   (0.07)   
inequality:NE                     -0.12 ***
                     (0.02)   
inequality:SO                     -0.14 ***
                     (0.01)   
inequality:WE                     -0.11 ***
                     (0.02)   
college       -0.05 ***-0.07 ***-2.17 ***
       (0.00)   (0.00)   (0.04)   
hs_grad       0.01 ***0.00 ** 0.19 ***
       (0.00)   (0.00)   (0.05)   
unempl       0.14 ***0.19 ***6.54 ***
       (0.01)   (0.01)   (0.23)   
child_poverty       0.05 ***              
       (0.00)                 
single_parent       0.01 ***0.00 ** -0.08    
       (0.00)   (0.00)   (0.05)   
severe_housing       0.02 ***-0.01 ** -0.81 ***
       (0.00)   (0.00)   (0.10)   
food_index       -0.00 ***-0.00 ***-0.09 ***
       (0.00)   (0.00)   (0.00)   
median_inc       -0.00 ***              
       (0.00)                 
Num.obs18465       15728       15728       15728       
R20.24    0.82    0.77    0.67    
adj.R20.24    0.82    0.76    0.67    
*** p < 0.001; ** p < 0.01; * p < 0.05.
library(readxl)
data = data.frame(read.csv("data/processed/analytic_data_2016_2021_with_regions.csv"))
newdata=na.omit(data)
a1=lm(mental_distress_rate~inequality,data = newdata)
summary(a1)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a1)

### The regression equation is: mental_distress_rate = 0.012920 * inequality + 0.068436
### The model a1 can explain 16.4% variance.
### The following is the figure of model a1. It can be seen that the fitting effect is not very good, and the points and lines are not collinear. This is because the R-Squared is relatively small and the percentage of variance explained is small, so the model fitting effect is not good, so this graph is actually meaningless. If R-Squared can reach about 90%, then the points and lines of this graph should basically coincide.
### Because the fitting effect of model a1 was not very good (low variance), I made a model a2. In model a2, I add other independent variables, including the median household income (median_inc), the percentage of households with severe housing problems (severe_housing), the unemployment rate (unempl), and the child poverty rate (child_poverty).
a2=lm(mental_distress_rate~inequality+median_inc+severe_housing+unempl+child_poverty,data = newdata)
summary(a2)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2)

### It can be seen from the statistical results of model a2 that all independent variables in the model 2 have significant predictive effects, that is, these independent variables (factors) will significantly affect the frequent mental distress rate.
### However, the R-squared of model a2 is still very low, only 35.5%. (In the process of model analysis, in general, the R-squared must reach 0.8 or more so that we can determine that this model is better).
#### Linear regression model of inequality and mental_distress_rate in the U.S. from 2016 to 2021
## Linear regression model of inequality and mental_distress_rate in the United States in 2016
D2016<-subset(newdata,year == "2016")
a2016=lm(mental_distress_rate~inequality,data = D2016)
summary(a2016)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016)

### The regression equation is: mental_distress_rate = 0.014374 * inequality + 0.049295
### The model can explain 25.3% variance.
## Linear regression model of inequality and mental_distress_rate in the United States in 2017
D2017<-subset(newdata,year == "2017")
a2017=lm(mental_distress_rate~inequality,data = D2017)
summary(a2017)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2017)

### The regression equation is: mental_distress_rate = 0.013137 * inequality + 0.058445
### The model can explain 27.4% variance.
### Linear regression model of inequality and mental_distress_rate in the United States in 2018
D2018<-subset(newdata,year == "2018")
a2018=lm(mental_distress_rate~inequality,data = D2018)
summary(a2018)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2018)

### The regression equation is: mental_distress_rate = 0.012085 * inequality + 0.068410
### The model can explain 23.8% variance.
### Linear regression model of inequality and mental_distress_rate in the United States in 2019
D2019<-subset(newdata,year == "2019")
a2019=lm(mental_distress_rate~inequality,data = D2019)
summary(a2019)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2019)

### The regression equation is: mental_distress_rate = 0.012539 * inequality + 0.066072
### The model can explain 24.5% variance.
### Linear regression model of inequality and mental_distress_rate in the United States in 2020
D2020<-subset(newdata,year == "2020")
a2020=lm(mental_distress_rate~inequality,data = D2020)
summary(a2020)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2020)

### The regression equation is: mental_distress_rate = 0.01334 * inequality + 0.07046
### The model can explain 26.3% variance.
### Linear regression model of inequality and mental_distress_rate in the United States in 2021
D2021<-subset(newdata,year == "2021")
a2021=lm(mental_distress_rate~inequality,data = D2021)
summary(a2021)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2021)

### The regression equation is: mental_distress_rate = 0.012272 * inequality + 0.097132
### The model can explain 15.3% variance.
# From above, we can see that in U.S. from 2016 to 2021, the inequality as independent variable is significant.But the R-squareds are separately 25.3%, 27.4%, 23.8%, 24.5%, 26.3%, and 15.3%, which means that the model is not good.
#### Linear regression model of inequality and mental_distress_rate in the Midwest region of U.S. from 2016 to 2021
### Linear regression model of inequality and mental_distress_rate in the Midwest region of U.S. in 2016
D2016M<-subset(D2016,region=="Midwest")
a2016M=lm(mental_distress_rate~inequality,data = D2016M)
summary(a2016M)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016M)

### The regression equation is: mental_distress_rate = 0.01074 * inequality + 0.05856
### The model can explain 11.2% variance.
### Linear regression model of inequality and mental_distress_rate in the Midwest region of U.S. in 2017
D2017M<-subset(D2017,region=="Midwest")
a2017M=lm(mental_distress_rate~inequality,data = D2017M)
summary(a2017M)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2017M)

### The regression equation is: mental_distress_rate = 0.010752 * inequality + 0.063096
### The model can explain 13.7% variance.
### Linear regression model of inequality and mental_distress_rate in the Midwest region of U.S. in 2018
D2018M<-subset(D2018,region=="Midwest")
a2018M=lm(mental_distress_rate~inequality,data = D2018M)
summary(a2018M)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2018M)

### The regression equation is: mental_distress_rate = 0.001156 * inequality + 0.06577
### The model can explain 14.9% variance.
### Linear regression model of inequality and mental_distress_rate in the Midwest region of U.S. in 2019
D2019M<-subset(D2019,region=="Midwest")
a2019M=lm(mental_distress_rate~inequality,data = D2019M)
summary(a2019M)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2019M)

### The regression equation is: mental_distress_rate = 0.012505 * inequality + 0.061629
### The model can explain 16.4% variance.
### Linear regression model of inequality and mental_distress_rate in the Midwest region of U.S. in 2020
D2020M<-subset(D2020,region=="Midwest")
a2020M=lm(mental_distress_rate~inequality,data = D2020M)
summary(a2020M)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2020M)

### The regression equation is: mental_distress_rate = 0.012961 * inequality + 0.067486
### The model can explain 16.0% variance.
### Linear regression model of inequality and mental_distress_rate in the Midwest region of U.S. in 2021
D2021M<-subset(D2021,region=="Midwest")
a2021M=lm(mental_distress_rate~inequality,data = D2021M)
summary(a2021M)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2021M)

### The regression equation is: mental_distress_rate = 0.01050 * inequality + 0.09908
### The model can explain 6.88% variance.
# From above, we can see that the midwest region of U.S. from 2016 to 2021, the inequality as independent variable is significant.But the R-squareds are separately 11.2%, 13.7%, 14.9%, 16.4%, 16.0%, and 6.88%, which means that the model is not good.
#### Linear regression model of inequality and mental_distress_rate in the Northeast region of U.S. from 2016 to 2021
### Linear regression model of inequality and mental_distress_rate in the Northeast region of U.S. in 2016
D2016N<-subset(D2016,region=="Northeast")
a2016N=lm(mental_distress_rate~inequality,data = D2016N)
summary(a2016N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016N)

### The regression equation is: mental_distress_rate = 0.00329 * inequality + 0.09435
### The model can explain 4.18% variance.
### Linear regression model of inequality and mental_distress_rate in the Northeast region of U.S. in 2017
D2017N<-subset(D2017,region=="Northeast")
a2017N=lm(mental_distress_rate~inequality,data = D2017N)
summary(a2017N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016N)

### The regression equation is: mental_distress_rate = 0.003944 * inequality + 0.093585
### The model can explain 10% variance.
### Linear regression model of inequality and mental_distress_rate in the Northeast region of U.S. in 2018
D2018N<-subset(D2018,region=="Northeast")
a2018N=lm(mental_distress_rate~inequality,data = D2018N)
summary(a2018N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016N)

### The regression equation is: mental_distress_rate = 0.00146 * inequality + 0.11001
### The model can explain 1.16% variance.
### Linear regression model of inequality and mental_distress_rate in the Northeast region of U.S. in 2019
D2019N<-subset(D2019,region=="Northeast")
a2019N=lm(mental_distress_rate~inequality,data = D2019N)
summary(a2019N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016N)

### The regression equation is: mental_distress_rate = 0.001385 * inequality + 0.110462
### The model can explain 1.09% variance.
### Linear regression model of inequality and mental_distress_rate in the Northeast region of U.S. in 2020
D2020N<-subset(D2020,region=="Northeast")
a2020N=lm(mental_distress_rate~inequality,data = D2020N)
summary(a2020N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2020N)

### The regression equation is: mental_distress_rate = 0.000358 * inequality + 0.122363
### The model can explain 0.0635% variance.
### Linear regression model of inequality and mental_distress_rate in the Northeast region of U.S. in 2021
D2021N<-subset(D2021,region=="Northeast")
a2021N=lm(mental_distress_rate~inequality,data = D2021N)
summary(a2021N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2021N)

### The regression equation is: mental_distress_rate = -0.00523 * inequality + 0.16714
### The model can explain 5.04% variance.
# From above, we can see that the northeast region of U.S. from 2016 to 2021, the inequality as independent variable is significant in 2016, 2017 and 2021, is not significant in 2018, 2019 and 2020. And the R-squareds are separately 4.18%, 10%, 1.16%, 1.09%, 0.0635%, and 5.04%, which means that the model is not good. So the models of northeast region in 2018, 2019 and 2020 are meaningless.
#### Linear regression model of inequality, median_inc, severe_housing, unempl, child_poverty and mental_distress_rate in the Northeast region of U.S. from 2016 to 2021
### Linear regression model of inequality, median_inc, severe_housing, unempl, child_poverty and mental_distress_rate in the Northeast region of U.S. in 2016
D2016N<-subset(D2016,region=="Northeast")
a2016N=lm(mental_distress_rate~inequality+median_inc+severe_housing+unempl+child_poverty,data = D2016N)
summary(a2016N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016N)

### The model can explain 64% variance.
### Linear regression model of inequality, median_inc, severe_housing, unempl, child_poverty and mental_distress_rate in the Northeast region of U.S. in 2017
D2017N<-subset(D2017,region=="Northeast")
a2017N=lm(mental_distress_rate~inequality+median_inc+severe_housing+unempl+child_poverty,data = D2017N)
summary(a2017N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016N)

### The model can explain 72.2% variance.
### Linear regression model of inequality, median_inc, severe_housing, unempl, child_poverty and mental_distress_rate in the Northeast region of U.S. in 2018
D2018N<-subset(D2018,region=="Northeast")
a2018N=lm(mental_distress_rate~inequality+median_inc+severe_housing+unempl+child_poverty,data = D2018N)
summary(a2018N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016N)

### The model can explain 71.3% variance.
### Linear regression model of inequality, median_inc, severe_housing, unempl, child_poverty and mental_distress_rate in the Northeast region of U.S. in 2019
D2019N<-subset(D2019,region=="Northeast")
a2019N=lm(mental_distress_rate~inequality+median_inc+severe_housing+unempl+child_poverty,data = D2019N)
summary(a2019N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016N)

### The regression equation is: mental_distress_rate = 0.001385 * inequality + 0.110462
### The model can explain 69.5% variance.
### Linear regression model of inequality, median_inc, severe_housing, unempl, child_poverty and mental_distress_rate in the Northeast region of U.S. in 2020
D2020N<-subset(D2020,region=="Northeast")
a2020N=lm(mental_distress_rate~inequality+median_inc+severe_housing+unempl+child_poverty,data = D2020N)
summary(a2020N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2020N)

### The regression equation is: mental_distress_rate = 0.000358 * inequality + 0.122363
### The model can explain 67.9% variance.
### Linear regression model of inequality, median_inc, severe_housing, unempl, child_poverty and mental_distress_rate in the Northeast region of U.S. in 2021
D2021N<-subset(D2021,region=="Northeast")
a2021N=lm(mental_distress_rate~inequality+median_inc+severe_housing+unempl+child_poverty,data = D2021N)
summary(a2021N)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2021N)

### The model can explain 74.2% variance.
# From above, we can see that the northeast region of U.S. from 2016 to 2021, after add more independent variables, the R-Squared also increase.
#### Linear regression model of inequality and mental_distress_rate in the South region of U.S. from 2016 to 2021
### Linear regression model of inequality and mental_distress_rate in the South region of U.S. in 2016
D2016S<-subset(D2016,region=="South")
a2016S=lm(mental_distress_rate~inequality,data = D2016S)
summary(a2016S)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016S)

### The regression equation is: mental_distress_rate = 0.01218 * inequality + 0.06553
### The model can explain 19.5% variance.
### Linear regression model of inequality and mental_distress_rate in the South region of U.S. in 2017
D2017S<-subset(D2017,region=="South")
a2017S=lm(mental_distress_rate~inequality,data = D2017S)
summary(a2017S)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2017S)

### The regression equation is: mental_distress_rate = 0.010856 * inequality + 0.074482
### The model can explain 21.4% variance.
### Linear regression model of inequality and mental_distress_rate in the South region of U.S. in 2018
D2018S<-subset(D2018,region=="South")
a2018S=lm(mental_distress_rate~inequality,data = D2018S)
summary(a2018S)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2018S)

### The regression equation is: mental_distress_rate = 0.009553 * inequality + 0.085625
### The model can explain 17.4% variance.
### Linear regression model of inequality and mental_distress_rate in the South region of U.S. in 2019
D2019S<-subset(D2019,region=="South")
a2019S=lm(mental_distress_rate~inequality,data = D2019S)
summary(a2019S)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2019S)

### The regression equation is: mental_distress_rate = 0.009599 * inequality + 0.085632
### The model can explain 16.9% variance.
### Linear regression model of inequality and mental_distress_rate in the South region of U.S. in 2020
D2020S<-subset(D2020,region=="South")
a2020S=lm(mental_distress_rate~inequality,data = D2020S)
summary(a2020S)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2020S)

### The regression equation is: mental_distress_rate = 0.010878 * inequality + 0.088040
### The model can explain 21.2% variance.
### Linear regression model of inequality and mental_distress_rate in the South region of U.S. in 2021
D2021S<-subset(D2021,region=="South")
a2021S=lm(mental_distress_rate~inequality,data = D2021S)
summary(a2021S)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2021S)

### The regression equation is: mental_distress_rate = 0.010021 * inequality + 0.115671
### The model can explain 13% variance.
# From above, we can see that the south region of U.S. from 2016 to 2021, the inequality as independent variable is significant. But the R-squareds are separately 19.5%, 21.4%, 17.4%, 16.9%, 21.2%, and 13%, which means that the model is not good.
#### Linear regression model of inequality and mental_distress_rate in the West region of U.S. from 2016 to 2021
### Linear regression model of inequality and mental_distress_rate in the West region of U.S. in 2016
D2016W<-subset(D2016,region=="West")
a2016W=lm(mental_distress_rate~inequality,data = D2016W)
summary(a2016W)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2016W)

### The regression equation is: mental_distress_rate = 0.01030 * inequality + 0.06330
### The model can explain 18.7% variance.
### Linear regression model of inequality and mental_distress_rate in the West region of U.S. in 2017
D2017W<-subset(D2017,region=="West")
a2017W=lm(mental_distress_rate~inequality,data = D2017W)
summary(a2017W)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2017W)

### The regression equation is: mental_distress_rate = 0.01055 * inequality + 0.06715
### The model can explain 21.4% variance.
### Linear regression model of inequality and mental_distress_rate in the West region of U.S. in 2018
D2018W<-subset(D2018,region=="West")
a2018W=lm(mental_distress_rate~inequality,data = D2018W)
summary(a2018W)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2018W)

### The regression equation is: mental_distress_rate = 0.01025 * inequality + 0.07286
### The model can explain 20% variance.
### Linear regression model of inequality and mental_distress_rate in the West region of U.S. in 2019
D2019W<-subset(D2019,region=="West")
a2019W=lm(mental_distress_rate~inequality,data = D2019W)
summary(a2019W)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2019W)

### The regression equation is: mental_distress_rate = 0.010694 * inequality + 0.070536
### The model can explain 22.9% variance.
### Linear regression model of inequality and mental_distress_rate in the West region of U.S. in 2020
D2020W<-subset(D2020,region=="West")
a2020W=lm(mental_distress_rate~inequality,data = D2020W)
summary(a2020W)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2020W)

### The regression equation is: mental_distress_rate = 0.010574 * inequality + 0.078467
### The model can explain 21.7% variance.
### Linear regression model of inequality and mental_distress_rate in the West region of U.S. in 2021
D2021W<-subset(D2021,region=="West")
a2021W=lm(mental_distress_rate~inequality,data = D2021W)
summary(a2021W)
plot(newdata$inequality,newdata$mental_distress_rate,xlab = "inequality",ylab = "mental_distress_rate")
abline(a2021W)

### The regression equation is: mental_distress_rate = 0.00829 * inequality + 0.10403
### The model can explain 9.3% variance.
# From above, we can see that the west region of U.S. from 2016 to 2021, the inequality as independent variable is significant. But the R-squareds are separately 18.7%, 21.4%, 20%, 22.9%, 21.7%, and 9.3%, which means that the model is not good.

Comparing the Dependent Variables – Xuan

The results come from mental health days and mental distress rate are same. All of the results of the models have statistical meaning. Our data analysis models, whether by year or by region, are just linear equations in one variable. To make our model more convincing, we added more independent variables. When we want to permit the coefficients to differ by region, we included an interaction term between inequality and region. Both Mark and Xuan found that while the Northeast has the highest estimated amount of mental health days and mental distress rate. The coefficient on ‘inequality’ of northeast region is near 0. Conversely, the midwest region has the lowest estimate of mental health days and mental distress rate, while exhibiting the strongest association. And after including the set of control variables, the coefficients for inequality and mental health days or mental distress rate differ by region. However, now it appears that the South has the weakest association between inequality and mental_distress_rate and the Midwest still has the strongest. All of them happens in the two dependet variables.

Feature Selection and Linear Model Evaluation – Shumel

One thing we can do right away is to clean up the data a little. Remove variables that we know for sure are useless. In our case, let us remove some cumulative player stats to make the dataframe more manageable.

#rankedfsf = ranked[ -c(1:9) ] # cleaned datasetnndf <- nndf1[c(1:23)]
rankedfs = ranked [c(10:33)] 
head(rankedfs)

Details

Since this function returns separate best models of all sizes up to nvmax and since different model selection criteria such as AIC, BIC, CIC, DIC, … differ only in how models of different sizes are compared, the results do not depend on the choice of cost-complexity tradeoff. When x is a biglm object it is assumed to be the full model, so force.out is not relevant. If there is an intercept it is forced in by default; specify a force.in as a logical vector with FALSE as the first element to allow the intercept to be dropped. The model search does not actually fit each model, so the returned object does not contain coefficients or standard errors. Coefficients and the variance-covariance matrix for one or model models can be obtained with the coef and vcov methods.

Target Variable: mental_health_days

1. Perform Linear Regression with All Predictors

#Before selecting the best subset of predictors for our regression, let’s run a simple linear regression on our dataset with all predictors to set the base adjusted r² for comparison.

lm1 <- lm(rankedfs,formula=mental_health_days ~. -mental_distress_rate)
summary(lm1)
## 
## Call:
## lm(formula = mental_health_days ~ . - mental_distress_rate, data = rankedfs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3919 -0.3232 -0.0201  0.3089  2.2005 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.14e+00   4.51e-01   15.84  < 2e-16 ***
## inequality          1.49e-01   7.50e-03   19.87  < 2e-16 ***
## median_inc          2.26e-06   6.38e-07    3.54  0.00040 ***
## hs_grad             6.10e-01   5.95e-02   10.26  < 2e-16 ***
## college            -2.68e+00   5.87e-02  -45.71  < 2e-16 ***
## unempl             -4.32e+00   2.67e-01  -16.16  < 2e-16 ***
## child_poverty       2.88e+00   1.16e-01   24.73  < 2e-16 ***
## single_parent      -1.71e+00   6.77e-02  -25.28  < 2e-16 ***
## severe_housing     -4.23e-01   1.26e-01   -3.35  0.00081 ***
## food_index         -3.71e-02   6.10e-03   -6.08  1.2e-09 ***
## mh_providers        3.62e+01   3.09e+00   11.71  < 2e-16 ***
## pop_provider_ratio -1.56e-05   1.55e-06  -10.10  < 2e-16 ***
## pop                 6.98e-08   1.30e-08    5.36  8.4e-08 ***
## pct_below18        -1.99e+00   1.52e-01  -13.13  < 2e-16 ***
## pct_black          -5.04e+00   4.46e-01  -11.30  < 2e-16 ***
## pct_native_am      -4.20e+00   4.68e-01   -8.98  < 2e-16 ***
## pct_asian          -7.42e+00   5.12e-01  -14.50  < 2e-16 ***
## pct_pacific        -9.93e+00   1.49e+00   -6.67  2.6e-11 ***
## pct_hispanic       -5.86e+00   4.31e-01  -13.59  < 2e-16 ***
## pct_white          -5.08e+00   4.49e-01  -11.32  < 2e-16 ***
## pct_female          6.28e+00   2.26e-01   27.83  < 2e-16 ***
## pct_rural          -2.19e-01   2.02e-02  -10.86  < 2e-16 ***
## region_abbNE        1.59e-01   1.68e-02    9.44  < 2e-16 ***
## region_abbS         1.76e-01   1.15e-02   15.30  < 2e-16 ***
## region_abbW         1.58e-01   1.55e-02   10.16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 15703 degrees of freedom
##   (2742 observations deleted due to missingness)
## Multiple R-squared:  0.485,  Adjusted R-squared:  0.485 
## F-statistic:  617 on 24 and 15703 DF,  p-value: <2e-16
# which all feature selection methods we can perform:
library(ISLR)
library(leaps)
rankedfs <- na.omit(rankedfs)
regfit.full = regsubsets(mental_health_days ~.-mental_distress_rate,rankedfs, nvmax=23)
reg.summary <- summary(regfit.full)
names(reg.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
# Feature Selection: mental_health_days
library(leaps)
reg.best10 <- regsubsets(mental_health_days~. - mental_distress_rate , data = rankedfs, nvmax = NULL, nbest = 1, method = "exhaustive")
                             
reg.best10                   
## Subset selection object
## Call: regsubsets.formula(mental_health_days ~ . - mental_distress_rate, 
##     data = rankedfs, nvmax = NULL, nbest = 1, method = "exhaustive")
## 25 Variables  (and intercept)
##                    Forced in Forced out
## inequality             FALSE      FALSE
## median_inc             FALSE      FALSE
## hs_grad                FALSE      FALSE
## college                FALSE      FALSE
## unempl                 FALSE      FALSE
## child_poverty          FALSE      FALSE
## single_parent          FALSE      FALSE
## severe_housing         FALSE      FALSE
## food_index             FALSE      FALSE
## mh_providers           FALSE      FALSE
## pop_provider_ratio     FALSE      FALSE
## pop                    FALSE      FALSE
## pct_below18            FALSE      FALSE
## pct_black              FALSE      FALSE
## pct_native_am          FALSE      FALSE
## pct_asian              FALSE      FALSE
## pct_pacific            FALSE      FALSE
## pct_hispanic           FALSE      FALSE
## pct_white              FALSE      FALSE
## pct_female             FALSE      FALSE
## pct_rural              FALSE      FALSE
## region_abbMW           FALSE      FALSE
## region_abbNE           FALSE      FALSE
## region_abbS            FALSE      FALSE
## region_abbW            FALSE      FALSE
## 1 subsets of each size up to 24
## Selection Algorithm: exhaustive
summary.out <- summary(reg.best10)
as.data.frame(summary.out$outmat)

Best model at each variable number

The best model in the 10-variable case includes all variables, as that is the only way to have 10 variables.

plot(reg.best10, scale = "adjr2", main = "Adjusted R^2")

plot(reg.best10, scale = "r2", main = "R^2")

plot(reg.best10, scale = "bic", main = "BIC")

plot(reg.best10, scale = "Cp", main = "Cp")

coef(reg.best10, 10, scale = "adjr2") # default BIC
##        (Intercept)         inequality            college             unempl 
##           2.20e+00           1.57e-01          -2.51e+00          -4.49e+00 
##      child_poverty      single_parent       mh_providers pop_provider_ratio 
##           2.80e+00          -1.55e+00           4.59e+01          -1.95e-05 
##       pct_hispanic         pct_female       region_abbMW 
##          -1.12e+00           5.68e+00          -1.91e-01
summary.out$which[10,]
##        (Intercept)         inequality         median_inc            hs_grad 
##               TRUE               TRUE              FALSE              FALSE 
##            college             unempl      child_poverty      single_parent 
##               TRUE               TRUE               TRUE               TRUE 
##     severe_housing         food_index       mh_providers pop_provider_ratio 
##              FALSE              FALSE               TRUE               TRUE 
##                pop        pct_below18          pct_black      pct_native_am 
##              FALSE              FALSE              FALSE              FALSE 
##          pct_asian        pct_pacific       pct_hispanic          pct_white 
##              FALSE              FALSE               TRUE              FALSE 
##         pct_female          pct_rural       region_abbMW       region_abbNE 
##               TRUE              FALSE               TRUE              FALSE 
##        region_abbS        region_abbW 
##              FALSE              FALSE

2. Perform Linear Regression with selected variables after feature selection

lm2 <- lm(rankedfs,formula=mental_health_days ~ child_poverty + inequality + unempl + college + mh_providers + pop_provider_ratio + region_abb + single_parent + pct_hispanic + pct_female)
summary(lm2)
## 
## Call:
## lm(formula = mental_health_days ~ child_poverty + inequality + 
##     unempl + college + mh_providers + pop_provider_ratio + region_abb + 
##     single_parent + pct_hispanic + pct_female, data = rankedfs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9949 -0.3438 -0.0244  0.3178  2.1867 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.12e+00   1.02e-01   20.72   <2e-16 ***
## child_poverty       2.73e+00   9.27e-02   29.40   <2e-16 ***
## inequality          1.51e-01   7.28e-03   20.79   <2e-16 ***
## unempl             -4.09e+00   2.64e-01  -15.50   <2e-16 ***
## college            -2.47e+00   5.47e-02  -45.05   <2e-16 ***
## mh_providers        5.09e+01   3.00e+00   16.98   <2e-16 ***
## pop_provider_ratio -2.05e-05   1.55e-06  -13.25   <2e-16 ***
## region_abbNE        1.62e-01   1.61e-02   10.02   <2e-16 ***
## region_abbS         2.18e-01   1.05e-02   20.70   <2e-16 ***
## region_abbW         1.33e-01   1.45e-02    9.17   <2e-16 ***
## single_parent      -1.57e+00   5.65e-02  -27.81   <2e-16 ***
## pct_hispanic       -1.08e+00   3.18e-02  -33.96   <2e-16 ***
## pct_female          5.45e+00   2.16e-01   25.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.49 on 15715 degrees of freedom
## Multiple R-squared:  0.461,  Adjusted R-squared:  0.461 
## F-statistic: 1.12e+03 on 12 and 15715 DF,  p-value: <2e-16

Plot Output from regsubsets Function in leaps package

This is just another way of presenting the same information for adjusted \(R^2\).

Mallow Cp is used to decide on the number of predictors to include. The stopping rule is to start with the smallest model and gradually increase number of variables, and stop when Mallow Cp is approximately (number of regressors + 1, broken line) for the first time.

library(car)
library(carData)
res.legend <-
    subsets(regsubsets(mental_health_days~. -mental_distress_rate , data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="adjr2", legend = FALSE, min.size = 5, main = "Adjusted R^2")

## Mallow Cp
res.legend <-
    subsets(regsubsets(mental_health_days~. -mental_distress_rate , data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="cp", legend = FALSE, min.size = 5, main = "Mallow Cp")
abline(a = 1, b = 1, lty = 2)

#BIC
res.legend <-
    subsets(regsubsets(mental_health_days~. -mental_distress_rate , data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="bic", legend = FALSE, min.size = 5, main = "BIC")
abline(a = 1, b = 1, lty = 2)

#r2
res.legend <-
    subsets(regsubsets(mental_health_days~. -mental_distress_rate , data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="rsq", legend = FALSE, min.size = 5, main = "R square")
abline(a = 1, b = 1, lty = 2)

res.legend
##                    Abbreviation
## inequality                    i
## median_inc                  md_
## hs_grad                       h
## college                      cl
## unempl                        u
## child_poverty                c_
## single_parent               sn_
## severe_housing              sv_
## food_index                    f
## mh_providers                mh_
## pop_provider_ratio         pp__
## pop                          pp
## pct_below18                 p_1
## pct_black                 pct_b
## pct_native_am              pc__
## pct_asian                 pct_s
## pct_pacific               pct_p
## pct_hispanic              pct_h
## pct_white                 pct_w
## pct_female                pct_f
## pct_rural                 pct_r
## region_abbMW                r_M
## region_abbNE                r_N
## region_abbS                 r_S
## region_abbW                 r_W
# See which model has the highest R Square :
which.max(summary.out$rsq)
## [1] 24

The model with 24 variables has the highest RSquare. Variables marked with TRUE are the ones chosen.

plot(summary.out$adjr2,xlab="number of variables", ylab="R Square", type="l")
points(24, summary.out$rsq[24], col='red', cex=2, pch=20)

# See which model has the highest adjusted R2 :
which.max(summary.out$adjr2)
## [1] 23

The model with 24 variables has the highest adjusted \(R^2\). Variables marked with TRUE are the ones chosen.

plot(summary.out$adjr2,xlab="number of variables", ylab="adjr2", type="l")
points(24, summary.out$adjr2[24], col='red', cex=2, pch=20)

#See which model has the lowest CP :
which.min(summary.out$cp)
## [1] 22

The model with 24 variables has the lowest CP.

plot(summary.out$cp,xlab="number of variables", ylab="cp", type="l")
points(24, summary.out$cp[24], col='red', cex=2, pch=20)

#See which model has the lowest BIC :
which.min(summary.out$bic)
## [1] 22

The model with 24 variables has the lowest CP.

plot(summary.out$bic,xlab="number of variables", ylab="BIC", type="l")
points(24, summary.out$bic[24], col='red', cex=2, pch=20)

#forward and backward selection is same as exhaustive just slight different in forward it start will no variable and iterate till the max variable and backward selection is otherway around.
#validation
set.seed(1)
train=sample(c(TRUE, FALSE), nrow(rankedfs), rep=T)
test = (!train)
regfit.best=regsubsets(mental_health_days~.-mental_distress_rate ,data=rankedfs[train,],nvmax = 23)
test.mat = model.matrix(mental_health_days~.-mental_distress_rate , data = rankedfs[test,])
val.errors = rep(NA, 23)
for(i in 1:23){
  coefi = coef(regfit.best, id=i)
   pred = test.mat[,names(coefi)]%*%coefi
   val.errors[i]= mean((rankedfs$mental_health_days[test]-pred)^2)
   }
val.errors
##  [1] 0.327 0.311 0.292 0.279 0.271 0.265 0.256 0.251 0.246 0.243 0.242 0.241
## [13] 0.238 0.237 0.235 0.234 0.234 0.232 0.232 0.232 0.232 0.231 0.231
which.min(val.errors)
## [1] 22
coef(regfit.best,23)
##        (Intercept)         inequality         median_inc            hs_grad 
##           6.52e+00           1.52e-01           2.05e-06           5.91e-01 
##            college             unempl      child_poverty      single_parent 
##          -2.69e+00          -3.96e+00           2.68e+00          -1.70e+00 
##     severe_housing         food_index       mh_providers pop_provider_ratio 
##          -3.53e-01          -4.29e-02           3.84e+01          -1.66e-05 
##                pop        pct_below18          pct_black      pct_native_am 
##           7.13e-08          -2.02e+00          -4.23e+00          -3.33e+00 
##          pct_asian        pct_pacific       pct_hispanic          pct_white 
##          -6.52e+00          -8.41e+00          -5.08e+00          -4.24e+00 
##         pct_female          pct_rural       region_abbMW        region_abbS 
##           6.32e+00          -1.90e-01          -1.54e-01           3.00e-02

Target Variable: mental_distress_rate

1.Perform Linear Regression with All Predictors

#1.Perform Linear Regression with All Predictors
#Before selecting the best subset of predictors for our regression, let’s run a simple linear regression on our dataset with all predictors to set the base adjusted r² for comparison.

lm3 <- lm(rankedfs,formula=mental_distress_rate ~. -mental_health_days )
summary(lm3)
## 
## Call:
## lm(formula = mental_distress_rate ~ . - mental_health_days, data = rankedfs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08588 -0.01023 -0.00131  0.00901  0.07827 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.11e-01   1.50e-02   14.06  < 2e-16 ***
## inequality          5.66e-03   2.50e-04   22.69  < 2e-16 ***
## median_inc          6.03e-08   2.12e-08    2.84   0.0045 ** 
## hs_grad             2.09e-02   1.98e-03   10.56  < 2e-16 ***
## college            -8.58e-02   1.95e-03  -43.94  < 2e-16 ***
## unempl             -2.07e-01   8.88e-03  -23.33  < 2e-16 ***
## child_poverty       1.11e-01   3.87e-03   28.67  < 2e-16 ***
## single_parent      -7.13e-02   2.25e-03  -31.66  < 2e-16 ***
## severe_housing     -6.85e-03   4.20e-03   -1.63   0.1031    
## food_index         -1.67e-03   2.03e-04   -8.23  < 2e-16 ***
## mh_providers        1.11e+00   1.03e-01   10.75  < 2e-16 ***
## pop_provider_ratio -5.44e-07   5.15e-08  -10.57  < 2e-16 ***
## pop                 1.14e-09   4.33e-10    2.64   0.0083 ** 
## pct_below18        -6.11e-02   5.05e-03  -12.08  < 2e-16 ***
## pct_black          -1.36e-01   1.48e-02   -9.14  < 2e-16 ***
## pct_native_am      -9.82e-02   1.56e-02   -6.31  2.9e-10 ***
## pct_asian          -2.33e-01   1.70e-02  -13.69  < 2e-16 ***
## pct_pacific        -2.49e-01   4.95e-02   -5.04  4.8e-07 ***
## pct_hispanic       -1.67e-01   1.43e-02  -11.64  < 2e-16 ***
## pct_white          -1.51e-01   1.49e-02  -10.12  < 2e-16 ***
## pct_female          2.06e-01   7.51e-03   27.50  < 2e-16 ***
## pct_rural          -4.51e-03   6.71e-04   -6.72  1.9e-11 ***
## region_abbNE        1.69e-03   5.60e-04    3.02   0.0025 ** 
## region_abbS         3.24e-03   3.83e-04    8.47  < 2e-16 ***
## region_abbW         2.59e-03   5.16e-04    5.03  5.0e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0159 on 15703 degrees of freedom
## Multiple R-squared:  0.522,  Adjusted R-squared:  0.521 
## F-statistic:  714 on 24 and 15703 DF,  p-value: <2e-16
# feature selection(mental distress rate)
library(leaps)
reg2.best10 <- regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = NULL, nbest = 1, method = "exhaustive")
                             
reg2.best10                   
## Subset selection object
## Call: regsubsets.formula(mental_distress_rate ~ . - mental_health_days, 
##     data = rankedfs, nvmax = NULL, nbest = 1, method = "exhaustive")
## 25 Variables  (and intercept)
##                    Forced in Forced out
## inequality             FALSE      FALSE
## median_inc             FALSE      FALSE
## hs_grad                FALSE      FALSE
## college                FALSE      FALSE
## unempl                 FALSE      FALSE
## child_poverty          FALSE      FALSE
## single_parent          FALSE      FALSE
## severe_housing         FALSE      FALSE
## food_index             FALSE      FALSE
## mh_providers           FALSE      FALSE
## pop_provider_ratio     FALSE      FALSE
## pop                    FALSE      FALSE
## pct_below18            FALSE      FALSE
## pct_black              FALSE      FALSE
## pct_native_am          FALSE      FALSE
## pct_asian              FALSE      FALSE
## pct_pacific            FALSE      FALSE
## pct_hispanic           FALSE      FALSE
## pct_white              FALSE      FALSE
## pct_female             FALSE      FALSE
## pct_rural              FALSE      FALSE
## region_abbMW           FALSE      FALSE
## region_abbNE           FALSE      FALSE
## region_abbS            FALSE      FALSE
## region_abbW            FALSE      FALSE
## 1 subsets of each size up to 24
## Selection Algorithm: exhaustive
summary2.out <- summary(reg2.best10)
as.data.frame(summary2.out$outmat)
plot(reg2.best10, scale = "adjr2", main = "Adjusted R^2")

plot(reg2.best10, scale = "r2", main = "R^2")

plot(reg2.best10, scale = "bic", main = "BIC")

plot(reg2.best10, scale = "Cp", main = "Cp")

coef(reg2.best10, 10, scale = "adjr2") # default BIC
##   (Intercept)    inequality       college        unempl child_poverty 
##       0.05396       0.00597      -0.07621      -0.20307       0.11434 
## single_parent  mh_providers pct_native_am  pct_hispanic    pct_female 
##      -0.05768       1.86752       0.04196      -0.02817       0.17954 
##   region_abbS 
##       0.00491
summary2.out$which[10,]
##        (Intercept)         inequality         median_inc            hs_grad 
##               TRUE               TRUE              FALSE              FALSE 
##            college             unempl      child_poverty      single_parent 
##               TRUE               TRUE               TRUE               TRUE 
##     severe_housing         food_index       mh_providers pop_provider_ratio 
##              FALSE              FALSE               TRUE              FALSE 
##                pop        pct_below18          pct_black      pct_native_am 
##              FALSE              FALSE              FALSE               TRUE 
##          pct_asian        pct_pacific       pct_hispanic          pct_white 
##              FALSE              FALSE               TRUE              FALSE 
##         pct_female          pct_rural       region_abbMW       region_abbNE 
##               TRUE              FALSE              FALSE              FALSE 
##        region_abbS        region_abbW 
##               TRUE              FALSE

2.Perform Linear Regression with selected variables after feature selection

lm4 <- lm(rankedfs,formula=mental_distress_rate ~  inequality + college + mh_providers + pct_female + child_poverty  + unempl + single_parent + region_abb + pct_hispanic + pct_native_am)
summary(lm4)
## 
## Call:
## lm(formula = mental_distress_rate ~ inequality + college + mh_providers + 
##     pct_female + child_poverty + unempl + single_parent + region_abb + 
##     pct_hispanic + pct_native_am, data = rankedfs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06968 -0.01070 -0.00149  0.00914  0.07660 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.053433   0.003393   15.75  < 2e-16 ***
## inequality     0.005854   0.000243   24.10  < 2e-16 ***
## college       -0.075564   0.001817  -41.58  < 2e-16 ***
## mh_providers   1.739600   0.095220   18.27  < 2e-16 ***
## pct_female     0.179805   0.007196   24.99  < 2e-16 ***
## child_poverty  0.116025   0.003090   37.54  < 2e-16 ***
## unempl        -0.210209   0.008833  -23.80  < 2e-16 ***
## single_parent -0.057428   0.001887  -30.43  < 2e-16 ***
## region_abbNE   0.002034   0.000539    3.77  0.00016 ***
## region_abbS    0.005652   0.000351   16.10  < 2e-16 ***
## region_abbW    0.001965   0.000487    4.04  5.4e-05 ***
## pct_hispanic  -0.029361   0.001060  -27.70  < 2e-16 ***
## pct_native_am  0.041526   0.002153   19.28  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0163 on 15715 degrees of freedom
## Multiple R-squared:  0.498,  Adjusted R-squared:  0.497 
## F-statistic: 1.3e+03 on 12 and 15715 DF,  p-value: <2e-16

Plot Output from regsubsets Function in leaps package

This is just another way of presenting the same information for adjusted \(R^2\).

Mallow Cp is used to decide on the number of predictors to include. The stopping rule is to start with the smallest model and gradually increase number of variables, and stop when Mallow Cp is approximately (number of regressors + 1, broken line) for the first time.

library(car)
library(carData)
res.legend <-
    subsets(regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="adjr2", legend = FALSE, min.size = 5, main = "Adjusted R^2")

## Mallow Cp
res.legend <-
    subsets(regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="cp", legend = FALSE, min.size = 5, main = "Mallow Cp")
abline(a = 1, b = 1, lty = 2)

#BIC
res.legend <-
    subsets(regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="bic", legend = FALSE, min.size = 5, main = "BIC")
abline(a = 1, b = 1, lty = 2)

#r2
res.legend <-
    subsets(regsubsets(mental_distress_rate~. -mental_health_days, data = rankedfs, nvmax = 10, nbest = 1, method = "exhaustive"), statistic="rsq", legend = FALSE, min.size = 5, main = "R square")
abline(a = 1, b = 1, lty = 2)

res.legend
##                    Abbreviation
## inequality                    i
## median_inc                  md_
## hs_grad                       h
## college                      cl
## unempl                        u
## child_poverty                c_
## single_parent               sn_
## severe_housing              sv_
## food_index                    f
## mh_providers                mh_
## pop_provider_ratio         pp__
## pop                          pp
## pct_below18                 p_1
## pct_black                 pct_b
## pct_native_am              pc__
## pct_asian                 pct_s
## pct_pacific               pct_p
## pct_hispanic              pct_h
## pct_white                 pct_w
## pct_female                pct_f
## pct_rural                 pct_r
## region_abbMW                r_M
## region_abbNE                r_N
## region_abbS                 r_S
## region_abbW                 r_W
#See which model has the highest R Square :
which.max(summary2.out$rsq)
## [1] 24

The model with 24 variables has the highest RSquare. Variables marked with TRUE are the ones chosen.

plot(summary2.out$adjr2,xlab="number of variables", ylab="R Square", type="l")
points(24, summary2.out$rsq[24], col='red', cex=2, pch=20)

#See which model has the highest adjusted R2 :
which.max(summary2.out$adjr2)
## [1] 24

The model with 24 variables has the highest adjusted \(R^2\). Variables marked with TRUE are the ones chosen.

plot(summary2.out$adjr2,xlab="number of variables", ylab="adjr2", type="l")
points(24, summary2.out$adjr2[24], col='red', cex=2, pch=20)

#See which model has the lowest CP :
which.min(summary2.out$cp)
## [1] 23

The model with 24 variables has the lowest CP.

plot(summary2.out$cp,xlab="number of variables", ylab="cp", type="l")
points(24, summary2.out$cp[24], col='red', cex=2, pch=20)

#See which model has the lowest BIC :
which.min(summary2.out$bic)
## [1] 19

The model with 21 variables has the lowest CP.

plot(summary2.out$bic,xlab="number of variables", ylab="BIC", type="l")
points(21, summary2.out$bic[21], col='red', cex=2, pch=20)

#forward and backward selection is same as exhaustive just slight different in forward it start will no variable and iterate till the max variable and backward selection is otherway around.
#validation
set.seed(1)
train=sample(c(TRUE, FALSE), nrow(rankedfs), rep=T)
test = (!train)
regfit.best=regsubsets(mental_distress_rate~.-mental_health_days,data=rankedfs[train,],nvmax = 23)
test.mat = model.matrix(mental_distress_rate~.-mental_health_days, data = rankedfs[test,])
val.errors = rep(NA, 23)
for(i in 1:23){
  coefi = coef(regfit.best, id=i)
   pred = test.mat[,names(coefi)]%*%coefi
   val.errors[i]= mean((rankedfs$mental_health_days[test]-pred)^2)
   }
val.errors
##  [1] 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1
## [16] 16.1 16.1 16.1 16.1 16.1 16.1 16.1 16.1
which.min(val.errors)
## [1] 13
coef(regfit.best,23)
##        (Intercept)         inequality         median_inc            hs_grad 
##           1.86e-01           5.57e-03           4.96e-08           2.15e-02 
##            college             unempl      child_poverty      single_parent 
##          -8.58e-02          -1.94e-01           1.04e-01          -7.10e-02 
##         food_index       mh_providers pop_provider_ratio                pop 
##          -1.93e-03           1.14e+00          -5.79e-07           1.13e-09 
##        pct_below18          pct_black      pct_native_am          pct_asian 
##          -6.20e-02          -1.06e-01          -6.75e-02          -1.97e-01 
##        pct_pacific       pct_hispanic          pct_white         pct_female 
##          -2.04e-01          -1.39e-01          -1.21e-01           2.06e-01 
##          pct_rural       region_abbMW       region_abbNE        region_abbS 
##          -3.46e-03          -2.69e-03          -1.19e-03           9.11e-04

ANOVA Comparison of Models

anovaRes_mental_health_days <- anova(lm1,lm2)
anovaRes_mental_distress_rate <- anova(lm3,lm4)

xkabledply(anovaRes_mental_health_days, title = "ANOVA comparison between two linear models for mental health days : lm1 and lm2")
ANOVA comparison between two linear models for mental health days : lm1 and lm2
Res.Df RSS Df Sum of Sq F Pr(>F)
15703 3608 NA NA NA NA
15715 3777 -12 -169 61.4 0
xkabledply(anovaRes_mental_distress_rate, title = "ANOVA comparison between two linear models for mental distress rate : lm3 and lm4")
ANOVA comparison between two linear models for mental distress rate : lm3 and lm4
Res.Df RSS Df Sum of Sq F Pr(>F)
15703 3.99 NA NA NA NA
15715 4.20 -12 -0.204 66.7 0

Regression Tree Models

Regression Tree - mental_health_days

In addition to the linear models above, regression tree models were also experimented with to determine if other model types could improve predicting the prevalence of the mental health variables and variable importance for predicting each mental health variable.

The regression tree model analyzes the relationship between number of poor mental health days versus all the economic inequality variables included in our dataset. Additionally, a region was included as an explanatory variable to observe the impact that a specific region has on the dependent variable. In other words, the model regresses mental_health_days against inequality, median_inc, hs_grad, college, unempl, child_poverty, single_parent, severe_housing, food_index, mh_providers, pop_provider_ratio, region, and year.

The initial regression tree model below has a max depth of 8 and a complexity penalty (cp), or complexity parameter, of 0.004. This is to allow for a large regression tree in order to produce a model that can best fit the data. However, it must be acknowledge that a larger tree will run the risk of over fitting the data making the model worse at predicting the target variable (i.e. the dependent variable).

The following regression tree has 17 leaf nodes.

The summary table of the regression tree indicates the order of variable importance for each variable within the regression. Interestingly, the top three most important variables in order of importance is year, child_pov, and median_inc.

#Regression Tree - Number of Mentally Unhealthy Days

#View(ranked)
#str(ranked)

#Remove NA values.
ranked <- na.omit(ranked)

#Creating Train and Test Set - ranked
set.seed(123)
train = sample(1:nrow(ranked), nrow(ranked)/1.25)

#First Model for Number of Mentally Unhealthy Days (ranked; all vars.)
model_ment_helth_day_tree <- rpart(mental_health_days ~ inequality + median_inc + hs_grad + college + unempl + child_poverty + single_parent + severe_housing + food_index + mh_providers + pop_provider_ratio + region + year, ranked, subset=train, control = list(maxdepth=8, cp=0.004))

#Summary for Number of Mentally Unhealthy Days
summ_table1=summary(model_ment_helth_day_tree)
table1_FI_tree<-data.frame(summ_table1$variable.importance)
table1_FI_tree %>%
  kbl(caption="Variable Importance: Target Y - mental_health_days (all vars.) ",
       format= "html", col.names = c("Feature Importance"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Variable Importance: Target Y - mental_health_days (all vars.)
Feature Importance
year 1586.83
child_poverty 1502.05
median_inc 1173.68
food_index 615.87
unempl 588.94
college 583.42
single_parent 545.39
inequality 54.16
region 52.43
mh_providers 14.08
hs_grad 13.31
pop_provider_ratio 12.39
severe_housing 7.11
#Plot Regression Tree
#rpart.plot(model_ment_helth_day_tree)
fancyRpartPlot(model_ment_helth_day_tree, main = "Regression Tree: Target Y - mental_health_days (all vars.)")

#Print Complexity Penalty
#printcp(model_ment_helth_day_tree)

#Plotting Complexity Parameters 
par(mar = c(5, 5, 10, 5)) 
plotcp(model_ment_helth_day_tree, minline = TRUE, lty = 3, col = 1, main="Regression Tree: CP Plot")

#R-squared
#rsq.rpart(model_ment_helth_day_tree)

Pruned Regression Tree - mental_health_days

The next regression is a pruned version of the previous regression tree. The regression tree is being pruned to see if a smaller model can improve the prediction capabilities of the model. The cp value is 0.0067 for the pruned regression. This cp value was chosen because in the cp plot of the previous regression shows that cp values above 0.0067 do not drastically improve the model’s X-val relative error. In other words, cp values above 0.0067 do not further drastically minimize the X-val relative error.

The following pruned tree has 13 leaf nodes.

The summary table of the regression tree indicates the order of variable importance for each variable within the regression. Again, the top three most important variables in order of importance is year, child_pov, and median_inc.

#Prune Tree - Number of Mentally Unhealthy Days

#Prune Tree Model
prune_model_ment_helth_day_tree=prune.rpart(model_ment_helth_day_tree, cp=0.0067, best=5)


#Summary of Pruned Tree
summ_table2=summary(prune_model_ment_helth_day_tree)
#summ_table2$variable.importance

table2_FI_tree<-data.frame(summ_table2$variable.importance)
table2_FI_tree %>%
  kbl(caption="Pruned Variable Importance: Target Y - mental_health_days (all vars.) ",
       format= "html", col.names = c("Feature Importance"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Pruned Variable Importance: Target Y - mental_health_days (all vars.)
Feature Importance
year 1579.616
child_poverty 1479.220
median_inc 1143.893
food_index 611.472
college 544.814
single_parent 536.570
unempl 516.602
inequality 53.924
region 52.431
hs_grad 11.096
mh_providers 3.515
pop_provider_ratio 1.827
severe_housing 0.868
#Plot Pruned Tree
fancyRpartPlot(prune_model_ment_helth_day_tree, main = "Pruned Regression Tree: Target Y - mental_health_days (all vars.)")

#Print Complexity Penalty of Prune Tree
#printcp(prune_model_ment_helth_day_tree)

#Plot Complexity Penalty of Prune Tree 
par(mar = c(5, 5, 10, 5)) 
plotcp(prune_model_ment_helth_day_tree,minline = TRUE, lty = 3, col = 1, main="Pruned Regression Tree: CP Plot")

#R-squared
#rsq.rpart(prune_model_ment_helth_day_tree)

Predicting/RMSE for Regression Trees - mental_health_days

The next section is focuses on how good the regression trees are at predicting the target variable. The models will be evaluated using RMSE. The lower the RMSE value the more accurate the model. The RMSE value of original regression tree is 0.427. The RMSE value of the pruned regression tree is 0.435. These RMSE values suggest that the original regression is a slightly better predicting model.

#Predicting and RMSE for Regression Trees

#Prediction/Testing - Original Model
yhat1=predict(model_ment_helth_day_tree,newdata=ranked[-train,])
test1=ranked[-train,"mental_health_days"]

plot(yhat1,test1, main="Regression Tree: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

#Prediction/Testing - Pruned Model
yhat_prune1=predict(prune_model_ment_helth_day_tree,newdata=ranked[-train,])


plot(yhat_prune1,test1, main="Pruned Regression Tree: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

Original Reg. Tree RMSE:

#MSE
#mean((yhat1-test)^2)

#RMSE
sqrt(mean((yhat1-test1)^2))
## [1] 0.427

Pruned Tree RMSE:

#MSE
#mean((yhat_prune1-test)^2)

#RMSE
sqrt(mean((yhat_prune1-test1)^2))
## [1] 0.435

Regression Tree - mental_distress_rate

This section focuses on the mental_distress_rate variable. The regression tree model below analyzes the relationship between the poor mental health distress rate versus all the economic inequality variables included in our dataset. Additionally, a region and year variables were included as explanatory variables to observe the impact that a specific region and year have on predicting the dependent variable. In other words, the model regresses mental_distress_rate against inequality, median_inc, hs_grad, college, unempl, child_poverty, single_parent, severe_housing, food_index, mh_providers, pop_provider_ratio, region, and year.

The initial regression tree model below has a max depth of 8 and a complexity penalty (cp), or complexity parameter, of 0.004. This is to allow for a large regression tree in order to produce a model that can best fit the data. However, it must be acknowledge that a larger tree will run the risk of over fitting the data making the model worse at predicting the target variable (i.e. the dependent variable).

The following regression tree has 16 leaf nodes.

The summary table of the regression tree indicates the order of variable importance for each variable within the regression. Interestingly, the top three most important variables in order of importance is child_pov, year, and median_inc.

#Regression Trees - Mental Health Distress Rate

#Creating Train and Test Set - ranked
set.seed(123)
train = sample(1:nrow(ranked), nrow(ranked)/1.25)

#First Model for Mental Health Distress Rate (ranked; all vars.)
model_ment_dis_rate_tree <- rpart(mental_distress_rate ~ inequality + median_inc + hs_grad + college + unempl + child_poverty + single_parent + severe_housing + food_index + mh_providers + pop_provider_ratio + region + year, ranked, subset=train, control = list(maxdepth=8, cp=0.004))

#Summary for Number of Mentally Unhealthy Days
summ_table3=summary(model_ment_dis_rate_tree)
#summ_table3$variable.importance

table3_FI_tree<-data.frame(summ_table3$variable.importance)
table3_FI_tree %>%
  kbl(caption="Variable Importance: Target Y - mental_distress_rate (all vars.) ",
       format= "html", col.names = c("Feature Importance"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Variable Importance: Target Y - mental_distress_rate (all vars.)
Feature Importance
child_poverty 2.210
year 2.038
median_inc 1.987
food_index 1.117
college 1.111
single_parent 0.995
unempl 0.803
inequality 0.243
mh_providers 0.036
pop_provider_ratio 0.035
hs_grad 0.015
region 0.012
severe_housing 0.002
#Plot Regression Tree
#rpart.plot(model_ment_dis_rate_tree)
fancyRpartPlot(model_ment_dis_rate_tree, main = "Regression Tree: Target Y - mental_distress_rate (all vars.)")

#Print Complexity Penalty
#printcp(model_ment_dis_rate_tree)


#Plotting Complexity Parameters
par(mar = c(5, 5, 10, 5))
plotcp(model_ment_dis_rate_tree, minline = TRUE, lty = 3, col = 1, main="Regression Tree: CP Plot")

#R-squared
#rsq.rpart(model_ment_dis_rate_tree)

Pruned Regression Tree - mental_distress_rate

The next regression is a pruned version of the previous regression tree. The regression tree is being pruned to see if a smaller model can improve the prediction capabilities of the model. The cp value is 0.001 for the pruned regression. This cp value was chosen because in the cp plot of the previous regression shows that cp values above 0.01 do not drastically improve the model’s X-val relative error. In other words, cp values above 0.01 do not further drastically minimize the X-val relative error.

The following pruned tree has 11 leaf nodes.

The summary table of the regression tree indicates the order of variable importance for each variable within the regression. Again, the top three most important variables in order of importance is child_pov, median_inc, and year.

#Prune Tree - Mental Health Distress Rate 

#Prune Tree Model
prune_model_ment_dis_rate_tree=prune.rpart(model_ment_dis_rate_tree, cp=0.01, best=5)

#Summary of Pruned Tree
summ_table4=summary(prune_model_ment_dis_rate_tree)
#summ_table4$variable.importance
table4_FI_tree<-data.frame(summ_table4$variable.importance)

table4_FI_tree %>%
  kbl(caption="Pruned Variable Importance: Target Y - mental_distress_rate (all vars.) ",
       format= "html", col.names = c("Feature Importance"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
Pruned Variable Importance: Target Y - mental_distress_rate (all vars.)
Feature Importance
child_poverty 2.193
median_inc 1.960
year 1.837
college 1.111
food_index 1.098
single_parent 0.981
unempl 0.751
inequality 0.227
mh_providers 0.035
pop_provider_ratio 0.034
region 0.012
hs_grad 0.006
severe_housing 0.001
#Plot Pruned Tree
fancyRpartPlot(prune_model_ment_dis_rate_tree, main = "Pruned Regression Tree: Target Y - mental_distress_rate (all vars.)")

#Print Complexity Penalty of Prune Tree
#printcp(prune_model_ment_dis_rate_tree)

#Plot Complexity Penalty of Prune Tree 
par(mar = c(5, 5, 10, 5))
plotcp(prune_model_ment_dis_rate_tree,minline = TRUE, lty = 3, col = 1, main="Pruned Regression Tree: CP Plot")

#R-squared
#rsq.rpart(prune_model_ment_dis_rate_tree)

Predicting/RMSE for Regression Trees - mental_distress_rate

The next section is focuses on how good the regression trees are at predicting the target variable. The models will be evaluated using RMSE. The lower the RMSE value the more accurate the model. The RMSE value of original regression tree is 0.0126. The RMSE value of the pruned regression tree is 0.0134. These RMSE values suggest that the original regression is a slightly better predicting model.

#Predicting and RMSE for Regression Trees

#Prediction/Testing - Original Model
yhat2=predict(model_ment_dis_rate_tree,newdata=ranked[-train,])
test2=ranked[-train,"mental_distress_rate"] 

plot(yhat2,test2, main="Regression Tree: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

#Prediction/Testing - Pruned Model
yhat_prune2=predict(prune_model_ment_dis_rate_tree,newdata=ranked[-train,])


plot(yhat_prune2,test2, main="Pruned Regression Tree: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

Original Reg. Tree RMSE:

#MSE
#mean((yhat2-test)^2)

#RMSE
sqrt(mean((yhat2-test2)^2))
## [1] 0.0126

Pruned Tree RMSE:

#MSE
#mean((yhat_prune2-test)^2)

#RMSE
sqrt(mean((yhat_prune2-test2)^2))
## [1] 0.0134

Random Forest Models

Random Forest - Num. of Mentally Unhealthy Days

Finally, random forest models were constructed to further analyze the variable importance and how accurately a different model type can predict the metal health target variables. The first random forest model regresses the number of poor mental health days versus all the economic inequality variables included in our dataset. Additionally, a region and year variables were included as explanatory variables to observe the impact that a specific region and year have on predicting the dependent variable.

The top three most important variables in order of importance by %IncMSE are year, median_inc, and region. The top three most important variables in order of importance by IncNudePurity are year, median_inc, and region. The RMSE value of this random forest is 0.327.

#Random Forest - Num. of Mentally Unhealthy Days

#Generate Random Forest Model - Num. of Mentally Unhealthy Days (all vars.)
set.seed(123)
bag_ment_hlth_days=randomForest(mental_health_days ~ inequality + median_inc + hs_grad + college + unempl + child_poverty + single_parent + severe_housing + food_index + mh_providers + pop_provider_ratio + region + year, data=ranked, subset=train, mtry=13, importance=TRUE)

#bag_ment_hlth_days

#Predict/Test Random Forest Model
yhat.bag1 = predict(bag_ment_hlth_days,newdata=ranked[-train,])
plot(yhat.bag1, test1, main="Random Forest: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

RMSE:

#MSE
#mean((yhat.bag1-test)^2)

#RMSE
sqrt(mean((yhat.bag1-test1)^2))
## [1] 0.327
#Determine Variable Importance
#importance(bag_ment_hlth_days)

#Plot Variable Importance 
varImpPlot(bag_ment_hlth_days, main = "Random Forest Vars. Importance: Target Y - mental_health_days (all vars.)")

Random Forest - Mental Distress Rate

The second random forest model regresses the number of mental health distress rate versus all the economic inequality variables included in our dataset. Additionally, a region and year variables were included as explanatory variables to observe the impact that a specific region and year have on predicting the dependent variable.

The top three most important variables in order of importance by %IncMSE are year, median_inc, and region. The top three most important variables in order of importance by IncNudePurity are year, child_poverty, and median_inc. The RMSE value of this random forest is 0.00915.

#Random Forest - Mental Distress Rate

#Generate Random Forest Model - Mental Health Distress Rate 
set.seed(123)
bag_ment_dis_rate=randomForest(mental_distress_rate ~ inequality + median_inc + hs_grad + college + unempl + child_poverty + single_parent + severe_housing + food_index + mh_providers + pop_provider_ratio + region + year, data=ranked, subset=train, mtry=13, importance=TRUE)

#bag_ment_dis_rate

#Predict/Test Random Forest Model
yhat.bag2 = predict(bag_ment_dis_rate,newdata=ranked[-train,])
plot(yhat.bag2, test2, main="Random Forest: Test vs. Predicted Values", xlab="Predicted Values", ylab="Test Values")
abline(0,1)

RMSE:

#MSE
#mean((yhat.bag1-test)^2)

#RMSE
sqrt(mean((yhat.bag2-test2)^2))
## [1] 0.00915
#Determine Variable Importance
#importance(bag_ment_dis_rate)

#Plot Variable Importance 
varImpPlot(bag_ment_dis_rate, main = "Random Forest Vars. Importance: Target Y - mental_distress_rate (all vars.)")

Overall, the random forest models were better predicting models relative to the regression tree models as indicated by their lower RMSE values for each respective target mental health variable.

RMSE Comparison for All Models

RMSE Model Comparisons - mental_health_days

Below is RMSE comparisons between all the models used in this project. Each model compares the target variable mental_health_days against the independent variables inequality, college, hs_grad, unempl, child_poverty, single_parent, severe_housing, food_index, median_inc, mh_providers, pop_provider_ratio, region, and year.

Based on the RMSE values the random forest is the best predicting model because it has the lowest RMSE value of 0.327. A low RMSE value indicates that the simulated predicted data is close to the observed test data, indicating better accuracy relative to higher RMSE values.

#RMSE Model Comparisons - mental_health_days

lm_days_train <- lm(mental_health_days ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + mh_providers + pop_provider_ratio
                + region + year, data=ranked, subset=train)

yhat_lm_1 = predict(lm_days_train,newdata=ranked[-train,])

#RMSE: Linear Model
a1<-sqrt(mean((yhat_lm_1-test1)^2))

#RMSE: Regression Tree
b1<-sqrt(mean((yhat1-test1)^2))

#RMSE: Pruned Regression Tree
c1<-sqrt(mean((yhat_prune1-test1)^2))

#RMSE: Random Forest
d1<-sqrt(mean((yhat.bag1-test1)^2))

df1 <- data.frame(Models1 <- c("Linear Model", "Reg. Tree", "Pruned Tree", "Random Forest"), RMSE1 <- c(a1, b1, c1, d1) )

df1 %>%
  kbl(caption="RMSE Comparisons - mental_health_days",
       format= "html", digits = 4, col.names = c("Models","RMSE"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
RMSE Comparisons - mental_health_days
Models RMSE
Linear Model 0.390
Reg. Tree 0.427
Pruned Tree 0.435
Random Forest 0.327

RMSE Model Comparisons - mental_distress_rate

Each model below compares the target variable mental_distress_rate against the independent variables inequality, college, hs_grad, unempl, child_poverty, single_parent, severe_housing, food_index, median_inc, mh_providers, pop_provider_ratio, region, and year.

Based on the RMSE values the random forest is the best predicting model because it has the lowest RMSE value of 0.0091. A low RMSE value indicates that the simulated predicted data is close to the observed test data, indicating better accuracy relative to higher RMSE values.

lm_dist_train <- lm(mental_distress_rate ~ inequality + college + hs_grad + unempl
                + child_poverty + single_parent + severe_housing + food_index
                + median_inc + mh_providers + pop_provider_ratio
                + region + year, data=ranked, subset=train)

yhat_lm_2 = predict(lm_dist_train,newdata=ranked[-train,])

#RMSE: Linear Model
a2<-sqrt(mean((yhat_lm_2-test2)^2))

#RMSE: Regression Tree
b2<-sqrt(mean((yhat2-test2)^2))

#RMSE: Pruned Regression Tree
c2<-sqrt(mean((yhat_prune2-test2)^2))

#RMSE: Random Forest
d2<-sqrt(mean((yhat.bag2-test2)^2))

df2 <- data.frame(Models2 <- c("Linear Model", "Reg. Tree", "Pruned Tree", "Random Forest"), RMSE2 <- c(a2, b2, c2, d2) )

df2 %>%
  kbl(caption="RMSE Comparisons - mental_distress_rate",
       format= "html", digits = 4, col.names = c("Models","RMSE"),
      align="r") %>%
   kable_classic_2(full_width = F, html_font = "helvetica")
RMSE Comparisons - mental_distress_rate
Models RMSE
Linear Model 0.0110
Reg. Tree 0.0126
Pruned Tree 0.0134
Random Forest 0.0091